Gerrit Code Review ESC meeting minutes published

114 views
Skip to first unread message

lucamilanesio

unread,
Mar 15, 2021, 5:32:56 PM3/15/21
to Repo and Gerrit Discussion
The Gerrit Code Review ESC meeting minutes have been published at:

Luca.

Sven Selberg

unread,
Mar 16, 2021, 4:35:22 AM3/16/21
to Repo and Gerrit Discussion
On Monday, March 15, 2021 at 10:32:56 PM UTC+1 lucamilanesio wrote:
The Gerrit Code Review ESC meeting minutes have been published at:

Regarding "Gerrit events rewrite and GCloud pub/sub notifications".
Where can I find a design document for this overhaul of the events system?

/Sven
 


Luca.

Han-Wen Nienhuys

unread,
Mar 16, 2021, 6:31:56 AM3/16/21
to Sven Selberg, Repo and Gerrit Discussion
The phrasing " beginning of the initiative of having Gerrit notes
exported as events" makes it sound as if there is a big initiative
starting, but the change linked as well as
https://gerrit-review.googlesource.com/c/gerrit/+/298962 is all there
is to it.

googlesource.com is a deployment that is very light on
backend-plugins, and we always ask our customers to go through the
REST API instead, which this mechanism provides. There is no further
overhaul planned, at least not from our side.

--
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--
Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

Sven Selberg

unread,
Mar 16, 2021, 7:05:45 AM3/16/21
to Repo and Gerrit Discussion
On Tuesday, March 16, 2021 at 11:31:56 AM UTC+1 han...@google.com wrote:
On Tue, Mar 16, 2021 at 9:35 AM Sven Selberg <sven.s...@axis.com> wrote:
>
>
>
> On Monday, March 15, 2021 at 10:32:56 PM UTC+1 lucamilanesio wrote:
>>
>> The Gerrit Code Review ESC meeting minutes have been published at:
>> https://www.gerritcodereview.com/2021-03-09-esc-minutes.html
>
>
> Regarding "Gerrit events rewrite and GCloud pub/sub notifications".
> Where can I find a design document for this overhaul of the events system?

The phrasing " beginning of the initiative of having Gerrit notes
exported as events" makes it sound as if there is a big initiative
starting, but the change linked as well as
https://gerrit-review.googlesource.com/c/gerrit/+/298962 is all there
is to it.

googlesource.com is a deployment that is very light on
backend-plugins, and we always ask our customers to go through the
REST API instead, which this mechanism provides. There is no further
overhaul planned, at least not from our side.

The change was advertised under the header:
"Gerrit events rewrite and GCloud pub/sub notifications" [1]

This together with the commit message [2]:
  """...
   This is part of an alternate approach to event notifications, where
   notification is split into two parts:
   1) notifying a client that *something* changed
   2) letting the client discover precisely *what* changed.
  ..."""

And this post [3] by Luca in january:
"""Our 2021 shopping list has a common these: making Gerrit more cloud-native.

1. Join the Gerrit v3.4 effort to rebuild the event system, based on protbuf and generic event brokers
2. Integrated with cloud-native event brokers: Google Cloud PubSub and AWS Kinesis"""

To me at least, hints that there is some sort of rewrite of the events that is planned or in progress.
And if there exists no such initiative I'm a bit confused about these statements.

BR
Sven

Luca Milanesio

unread,
Mar 16, 2021, 7:10:40 AM3/16/21
to Repo and Gerrit Discussion, Luca Milanesio, Sven Selberg
Talking about [3] that is more a GerritForge’s target for 2021: we are indeed planning to release the GCloud PubSub and AWS Kinesis events plugins for Gerrit v3.4.
They will be going side-by-side with the current ones for Kafka and RabbitMQ.


To me at least, hints that there is some sort of rewrite of the events that is planned or in progress.
And if there exists no such initiative I'm a bit confused about these statements.

I believe there is the plan, but we are behind for v3.4 and as Han-Wen mentioned there is only one (merged) change in Gerrit.

With regards to the design document, there was one written by Alice which is now largely obsolete anyway.
That’s the reason why it was abandoned.

Luca.


BR
Sven



--
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--
Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/243f7e17-cedb-42ea-a4f9-c9d43d2c75dfn%40googlegroups.com.

Han-Wen Nienhuys

unread,
Mar 16, 2021, 7:12:54 AM3/16/21
to Sven Selberg, Repo and Gerrit Discussion
We had internal interest at Google for event notification mid last
year, so we thought we'd try to see if it was a good moment to rework
the existing internal event system, and work in concert with Luca's
plans. Upon closer inspection, introducing a /meta_diff endpoint was a
much simpler mechanism to get what we wanted.

lucamilanesio

unread,
Mar 16, 2021, 7:32:07 AM3/16/21
to Repo and Gerrit Discussion
On Tuesday, March 16, 2021 at 11:12:54 AM UTC han...@google.com wrote:
On Tue, Mar 16, 2021 at 12:05 PM Sven Selberg <sven.s...@axis.com> wrote:
>> >> The Gerrit Code Review ESC meeting minutes have been published at:
>> >> https://www.gerritcodereview.com/2021-03-09-esc-minutes.html
>> >
>> >
>> > Regarding "Gerrit events rewrite and GCloud pub/sub notifications".
>> > Where can I find a design document for this overhaul of the events system?
>>
>> The phrasing " beginning of the initiative of having Gerrit notes
>> exported as events" makes it sound as if there is a big initiative
>> starting, but the change linked as well as
>> https://gerrit-review.googlesource.com/c/gerrit/+/298962 is all there
>> is to it.
>>
>> googlesource.com is a deployment that is very light on
>> backend-plugins, and we always ask our customers to go through the
>> REST API instead, which this mechanism provides. There is no further
>> overhaul planned, at least not from our side.

@Sven form our side (GerritForge) we continue to pursue the redesign of the Gerrit event system, but not on v3.4 where our intervention is limited to the two extra plugins to notify events on AWS and GCloud.

We are targeting potentially v3.5, with the idea to improve the event system, introducing protobuf for the events format, introduce a schema, etc....
As GerritForge we are not ready yet to publish a design document, but we'll do for sure in the next forthcoming weeks/months.

Hope this clarifies.

Luca.

Sven Selberg

unread,
Mar 16, 2021, 7:39:02 AM3/16/21
to Repo and Gerrit Discussion
IIUC Han-wen's change is not related to any event-rewrite.
 

With regards to the design document, there was one written by Alice which is now largely obsolete anyway.
That’s the reason why it was abandoned.

Why I'm really asking is that:
* I would be most interested in taking part in design-discussions and implementation of a new event system.
* I like Han-wen's idea of a "something happened" trigger, at least for Change-events.
   Potentially this could simplify the event pipeline and make all current Change-events triggers superfluous.
   At the end of the "potential" future event pipe there would be a ChangeEvent-generator that maps
   "something-happened" events into Change-Events.

/Sven

Luca Milanesio

unread,
Mar 16, 2021, 7:46:58 AM3/16/21
to Repo and Gerrit Discussion, Luca Milanesio
We don’t have a design document yet, so at the moment it is the only thing we’ve got.

 

With regards to the design document, there was one written by Alice which is now largely obsolete anyway.
That’s the reason why it was abandoned.

Why I'm really asking is that:
* I would be most interested in taking part in design-discussions and implementation of a new event system.

+1, will make sure that you’ll be in the list of reviewers

* I like Han-wen's idea of a "something happened" trigger, at least for Change-events.
   Potentially this could simplify the event pipeline and make all current Change-events triggers superfluous.
   At the end of the "potential" future event pipe there would be a ChangeEvent-generator that maps
   "something-happened" events into Change-Events.

Yep.

Also, I’ve added an extra sentence to clarify that Gerrit v3.4 isn’t the target for the events rewrite:

HTH

Luca.

lucamilanesio

unread,
Mar 16, 2021, 8:00:14 AM3/16/21
to Repo and Gerrit Discussion
I have also created an issue for tracking the need of creating a design document:

Luca.

Sven Selberg

unread,
Mar 16, 2021, 8:02:51 AM3/16/21
to Repo and Gerrit Discussion
Thank you Han-wen and Luca for taking the time to explain the situation to me.

/Sven 

Luca Milanesio

unread,
Mar 17, 2021, 7:15:57 AM3/17/21
to Han-Wen Nienhuys, Luca Milanesio, Sven Selberg, Repo and Gerrit Discussion
Would Google still be interested in our initiative for Gerrit master and targeting v3.5, from a design-review and code-review perspective?

Luca.

> Upon closer inspection, introducing a /meta_diff endpoint was a
> much simpler mechanism to get what we wanted.
>
>> [1] https://www.gerritcodereview.com/2021-03-09-esc-minutes.html#gerrit-events-rewrite-and-gcloud-pubsub-notifications
>> [2] https://gerrit-review.googlesource.com/c/gerrit/+/296326/19//COMMIT_MSG#27
>> [3] https://groups.google.com/g/repo-discuss/c/SSxMzG2G3oU
>
> --
> Han-Wen Nienhuys - Google Munich
> I work 80%. Don't expect answers from me on Fridays.
> --
>
> Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich
>
> Registergericht und -nummer: Hamburg, HRB 86891
>
> Sitz der Gesellschaft: Hamburg
>
> Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
>
> --
> --
> To unsubscribe, email repo-discuss...@googlegroups.com
> More info at http://groups.google.com/group/repo-discuss?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/CAFQ2z_MUTMViAOYES0P17Pt%3DZ_OfXkX7mcW4EA9OQ3h8Stp-OA%40mail.gmail.com.

Han-Wen Nienhuys

unread,
Mar 17, 2021, 8:01:44 AM3/17/21
to Luca Milanesio, Sven Selberg, Repo and Gerrit Discussion, Patrick Hiesel
On Wed, Mar 17, 2021 at 12:15 PM Luca Milanesio
<luca.mi...@gmail.com> wrote:
> >> And this post [3] by Luca in january:
> >> """Our 2021 shopping list has a common these: making Gerrit more cloud-native.
> >>
> >> 1. Join the Gerrit v3.4 effort to rebuild the event system, based on protbuf and generic event brokers
> >> 2. Integrated with cloud-native event brokers: Google Cloud PubSub and AWS Kinesis"""
> >>
> >> To me at least, hints that there is some sort of rewrite of the events that is planned or in progress.
> >> And if there exists no such initiative I'm a bit confused about these statements.
> >
> > We had internal interest at Google for event notification mid last
> > year, so we thought we'd try to see if it was a good moment to rework
> > the existing internal event system, and work in concert with Luca's
> > plans.
>
> Would Google still be interested in our initiative for Gerrit master and targeting v3.5, from a design-review and code-review perspective?

it depends on what precisely the initiative is :)

* We do want to migrate data types of external APIs to protobuf.
* We don't have a deep interest in cloud-native event brokers (as we
already have our own); it would be nice if newer events were
compatible with the GCP one I alluded to before, though.
* We don't use plugins extensively today, nor do we plan to. For that
reason, we don't care deeply about changes to the event system as it
exists inside Gerrit to talk to plugins, but we want to be in the
loop, because we potentially have to rewrite internal plugins if
anything changes.

I think the biggest weakness of the current event internal system is that

* it introduces a whole separate set of data types (which are out of
date, cf. attentionset).
* Gerrit has to provide information upfront, without knowing if the
data is ever going to be used (we compute diffs, for example)

if the internal event system were also based on ref update
notifications + meta_diff, that might be an improvement.

I'm adding Patrick in case he has further thoughts.

Maybe this discussion is better suited to the next ESC meeting?

Sven Selberg

unread,
Mar 17, 2021, 8:31:35 AM3/17/21
to Repo and Gerrit Discussion
I'm not claiming to have the entire picture, but, I suspect that an event
system were the event just says "something happened" could potentially
end up being more expensive for the server.
If you have 10 consumers of this event-stream you would get
10 * nbr_of_events calls to the meta-diff REST endpoint.
I would estimate that we have a fairly middle-of-the-road instance when
it comes to load and we have ~150.000 change-events every 24 hours.
That would mean an extra 1.5M calls to the REST API per day just so that
the consumers could find out if they should act on the event or not.

And that's not counting the extra computational effort from the event-stream
consumers that previously could filter events based on a local string comparison.

You would off course negate many of these downsides if you populated the
"something-happened" event with some cheap properties like [type, project, branch],
__which, now that I think about it, you haven't claimed that you wont. __
 
/Sven

Patrick Hiesel

unread,
Mar 17, 2021, 8:52:04 AM3/17/21
to Sven Selberg, Repo and Gerrit Discussion
We discussed that during our internal design as well. For our gs.com instance, the consensus was that these CI systems already call us today to do polling, so the additional traffic seems OK. If it turns out not to be the case we'd likely start shipping some information in the event as well. This doesn't apply to you, though, because the existing event system already ships information with events, so it would be a regression to stop that.

I am with Han-Wen as far as our interest goes and I think they are well aligned the the community interests:
- use protobufs for (REST) api types
- share these with the event system to have common entities

for the external event system, I think it makes sense to build a central dispatcher that gets only sha1s and then generates events (also rich events carrying payload). that has advantages:
- not all parts of the gerrit code that do something (say adding topics) need to think of dispatching events
- this component can be placed on another machine, so it scales easier and takes load off the serving machine(s)

Happy to discuss in the ESC what our involvement with reviews etc can look like.

Thanks,
Patrick
 
And that's not counting the extra computational effort from the event-stream
consumers that previously could filter events based on a local string comparison.

You would off course negate many of these downsides if you populated the
"something-happened" event with some cheap properties like [type, project, branch],
__which, now that I think about it, you haven't claimed that you wont. __
 
/Sven



if the internal event system were also based on ref update
notifications + meta_diff, that might be an improvement.

I'm adding Patrick in case he has further thoughts.

Maybe this discussion is better suited to the next ESC meeting?
--
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--

Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

Han-Wen Nienhuys

unread,
Mar 17, 2021, 10:00:36 AM3/17/21
to Sven Selberg, Repo and Gerrit Discussion
Is that a lot? The events always happen in response to writes; the
writes are in response to reviews either by humans or systems. A
single user that does a review certainly needs more than 10 requests
to load a change into their browser.

> And that's not counting the extra computational effort from the event-stream
> consumers that previously could filter events based on a local string comparison.
>
> You would off course negate many of these downsides if you populated the
> "something-happened" event with some cheap properties like [type, project, branch],
> __which, now that I think about it, you haven't claimed that you wont. __

"Something happened" is "project, branch". It doesn't include 'type',
because you'd have to wire through Gerrit's idea of the world to the
ref storage.

Sven Selberg

unread,
Mar 17, 2021, 10:09:53 AM3/17/21
to Repo and Gerrit Discussion
I'd say no,  that's not a lot when put into perspective.

Luca Milanesio

unread,
Mar 17, 2021, 10:21:33 AM3/17/21
to Repo and Gerrit Discussion, Luca Milanesio, Sven Selberg, Patrick Hiesel
Yep, those two goal are definitely the first and most urgent ones ones to address.

With regards to gs.com, you have addressed them thanks to Spanner and the publish to PubSub: I am keen with GerritForge to bring that to the rest of the community, regardless of what infrastructure they use.
I’ll put a discussion point with an early design document for review at the next ESC.

Luca.

Martin Fick

unread,
Mar 17, 2021, 12:47:23 PM3/17/21
to repo-d...@googlegroups.com, Sven Selberg
> > > system were the event just says "something happened" could potentially
> > > end up being more expensive for the server.
> > > If you have 10 consumers of this event-stream you would get
> > > 10 * nbr_of_events calls to the meta-diff REST endpoint.
> > > I would estimate that we have a fairly middle-of-the-road instance when
> > > it comes to load and we have ~150.000 change-events every 24 hours.
> > > That would mean an extra 1.5M calls to the REST API per day just so that
> > > the consumers could find out if they should act on the event or not.
> >
> > Is that a lot? The events always happen in response to writes; the
> > writes are in response to reviews either by humans or systems. A
> > single user that does a review certainly needs more than 10 requests
> > to load a change into their browser.
>
> I'd say no, that's not a lot when put into perspective.

I am concerned that it actually might be a lot, however I think the important
question is not whether it is a lot (#1), but rather whether it scales well or
not (#2).

But to answer the "a lot" part (#1), it very well could be. I think your 10
requests per review assumes that events are only created by a human reviewer.
CI systems may alter the ratios significantly as they may post many more
reviews than humans and require very little server load to do so. Of course,
the potential difference in this ratio and the potential difference in the
ratio of event consumers to producers (reviewers), leads to the question of
scalability (#2). So while you describe what may be a typical scenario, if we
want Gerrit to be scalable, we need it to be able to handle not just typical
use cases, but also extreme ones.

In extreme cases, their may be hundreds or thousands of consumers of events,
and having basic information such as a change's project and branch can be
essential to creating a scalable system. Since adding that info into the
original event is cheap and it allows for greater scalablility, that is likely
why that info was there in the first place. Removing that information seems
like a potential regression and a move towards a less scalable system than the
current one.

A third point which may also be overlooked, is latency. If a consumer has to
turn around and make another request to the server, that could add unwanted
latency for consumers which are across the globe. I'm not sure if that matters
to anyone?

-Martin

--
The Qualcomm Innovation Center, Inc. is a member of Code
Aurora Forum, hosted by The Linux Foundation

Luca Milanesio

unread,
Mar 17, 2021, 5:40:05 PM3/17/21
to Repo and Gerrit Discussion, Luca Milanesio, Sven Selberg, Martin Fick
Your last point is very important for us, because we are multi-site and the current architecture is an AP system, which is missing the ‘C’ part of the global consistency.
That means that when the CI system receives the event, it may NOT know from which instance it was generated and therefore would have no idea *which* Gerrit’s REST-API to call.

If you just call *any Gerrit* you may actually get a 404, because the change hasn’t been propagated yet.

This is actually happening *also* on *.gs.com and we can clearly see that from the Gerrit-CI builds: at times, even if the Checks API are reporting that a change needs verification, the associated REST-API to Gerrit are failing because the change hasn’t been replicated yet to that node.

Having instead the payload *inside* the notification message, or at least the most important parts of It that are needed for the CI to do his job, it would allow an AP system to work properly.

Anyway, we do need a design document where we put use-cases in writing and discuss organically on them :-)

Luca.

>
> -Martin
>
> --
> The Qualcomm Innovation Center, Inc. is a member of Code
> Aurora Forum, hosted by The Linux Foundation
>
> --
> --
> To unsubscribe, email repo-discuss...@googlegroups.com
> More info at http://groups.google.com/group/repo-discuss?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/1667594.b6kNENyT7j%40mfick-lnx.

Han-Wen Nienhuys

unread,
Mar 18, 2021, 5:14:01 AM3/18/21
to Luca Milanesio, Repo and Gerrit Discussion, Sven Selberg, Martin Fick
If you have many subscribers, you have to introduce a proxy that
computes the "what" of the notification, and then sends it out to many
subscribers. In that case, the proxy has to decide what information to
include, and what to omit.

> > In extreme cases, their may be hundreds or thousands of consumers of events,
> > and having basic information such as a change's project and branch can be
> > essential to creating a scalable system. Since adding that info into the
> > original event is cheap and it allows for greater scalablility, that is likely
> > why that info was there in the first place.

Not necessarily. The PatchsetCreated event contains a sizeInsertions
field. Computing it requires doing a diff, which is expensive for
large changes.

> > Removing that information seems
> > like a potential regression and a move towards a less scalable system than the
> > current one.
> >
> > A third point which may also be overlooked, is latency. If a consumer has to
> > turn around and make another request to the server, that could add unwanted
> > latency for consumers which are across the globe. I'm not sure if that matters
> > to anyone?
>
> Your last point is very important for us, because we are multi-site and the current architecture is an AP system, which is missing the ‘C’ part of the global consistency.
> That means that when the CI system receives the event, it may NOT know from which instance it was generated and therefore would have no idea *which* Gerrit’s REST-API to call.
>
> If you just call *any Gerrit* you may actually get a 404, because the change hasn’t been propagated yet.
>
> This is actually happening *also* on *.gs.com and we can clearly see that from the Gerrit-CI builds: at times, even if the Checks API are reporting that a change needs verification, the associated REST-API to Gerrit are failing because the change hasn’t been replicated yet to that node.
>
> Having instead the payload *inside* the notification message, or at least the most important parts of It that are needed for the CI to do his job, it would allow an AP system to work properly.

No, not in general. If the payload is inside the notification message,
the system receiving it will may issue follow-on action. That
follow-on may also connect to the wrong mirror, leading to failure or
unexpected outcomes.

The meta_diff design actually factors in this case: it will return 412
if the prerequisite update wasn't seen by the mirror. Follow-on
actions are still susceptible to lag. I have a follow-on design that
will offer a solution to this problem.

Sven Selberg

unread,
Mar 18, 2021, 5:28:55 AM3/18/21
to Repo and Gerrit Discussion
Just to be clear. I'm rather found of the "something happened" event and want
it to work for vanilla Gerrit and the general case/consumer.

In our case (with the current architecture) we wouldn't even notice if we got
thousands of new event-consumers since there's, in many cases, no need
for a round-trips and we offload the event publishing to an MQ.


>
> In extreme cases, their may be hundreds or thousands of consumers of events,
> and having basic information such as a change's project and branch can be
> essential to creating a scalable system. Since adding that info into the
> original event is cheap and it allows for greater scalablility, that is likely
> why that info was there in the first place. Removing that information seems
> like a potential regression and a move towards a less scalable system than the
> current one.
>
> A third point which may also be overlooked, is latency. If a consumer has to
> turn around and make another request to the server, that could add unwanted
> latency for consumers which are across the globe. I'm not sure if that matters
> to anyone?


Your last point is very important for us, because we are multi-site and the current architecture is an AP system, which is missing the ‘C’ part of the global consistency.
That means that when the CI system receives the event, it may NOT know from which instance it was generated and therefore would have no idea *which* Gerrit’s REST-API to call.

If you just call *any Gerrit* you may actually get a 404, because the change hasn’t been propagated yet.

This is actually happening *also* on *.gs.com and we can clearly see that from the Gerrit-CI builds: at times, even if the Checks API are reporting that a change needs verification, the associated REST-API to Gerrit are failing because the change hasn’t been replicated yet to that node.

We have managed to trigger similar behavior with a single-primary setup.
1. CI system reacts on event from RabbitMQ
2. a micro-service (with limited data) tries to GET the change with a revision SHA1 of latest patchset.
This round-trip was fast enough so that the index wasn't updated yet so Gerrit couldn't identify the change with the commit sha1 and returned 404.

So the CI resiliency problem is not unique to the push-poll approach of "something happened" and/or multi-primary systems.
 


Having instead the payload *inside* the notification message, or at least the most important parts of It that are needed for the CI to do his job, it would allow an AP system to work properly.

Anyway, we do need a design document where we put use-cases in writing and discuss organically on them :-)

Wouldn't an online (3 hour or so) conference to vent thoughts and ideas for the event rewrite be a good thing?

I have some ideas that are extremely simplified and that I haven't thought through:

* Write some sort of framework that can parse "something happened" events that can be used by event publishing plugins.
* Rip out the current "event-structure" (com.google.gerrit.server.events, com.google.gerrit.server.data, ?) from core into a lib (legacy-events)
* Remove stream-events ssh command from core Gerrit
* Write a plugin ("legacy-events") that uses the "something happened" framework and maps this to the the legacy-events lib to construct legacy events and implement a stream-events command equivalent.
* The legacy-events lib could also be used by other event-publishing plugins that are interested in keeping backwards compatibility.

Luca Milanesio

unread,
Mar 18, 2021, 5:29:24 AM3/18/21
to Repo and Gerrit Discussion, Luca Milanesio
The way the CI plugin currently resolves the issue is to wait for the subsequent replication event notification that confirms that the change has now been replicated to the target node and therefore it is safe to start the build.
We do not have replication events though on *.gs.com so we don’t have a solution yet there.

> That
> follow-on may also connect to the wrong mirror, leading to failure or
> unexpected outcomes.

This is what currently happens on Gerrit-CI: the problem is that the build fails, the contributor/maintainer isn’t happy because he feels to be let down by the CI validation, the Checks API do not return that change anymore because it has been already “consumed” and the build was in progress.

>
> The meta_diff design actually factors in this case: it will return 412
> if the prerequisite update wasn't seen by the mirror. Follow-on
> actions are still susceptible to lag. I have a follow-on design that
> will offer a solution to this problem.

Sounds good, thanks and looking forward to it, so that we can fix the Gerrit-CI trigger and build validation on that.
Zuul also will have the same problem and would benefit from the solution design.

Luca.

Luca Milanesio

unread,
Mar 18, 2021, 6:52:11 AM3/18/21
to Repo and Gerrit Discussion, Luca Milanesio, Sven Selberg
But I believe you have a primary-replica setup, isn’ t it? Otherwise the CI would see the change immediately without having the micro-service checking for it.
Or are you saying that even with a single Gerrit service (no replicas) you may hit the same issue? If yes, that means that we are notifying the events *too soon* when they haven’t actually happened yet, which is a genuine bug to fix.

 


Having instead the payload *inside* the notification message, or at least the most important parts of It that are needed for the CI to do his job, it would allow an AP system to work properly.

Anyway, we do need a design document where we put use-cases in writing and discuss organically on them :-)

Wouldn't an online (3 hour or so) conference to vent thoughts and ideas for the event rewrite be a good thing?

It would take *more* than 3h to discuss all the edge cases though :-)
I would propose an initial design + reviews and *after that* a conference call on that, so that we have already some “pre-digested” ideas and discussions around it.


I have some ideas that are extremely simplified and that I haven't thought through:

* Write some sort of framework that can parse "something happened" events that can be used by event publishing plugins.

Yep, have you seen already the events-broker libModule at [1] ?

* Rip out the current "event-structure" (com.google.gerrit.server.events, com.google.gerrit.server.data, ?) from core into a lib (legacy-events)

The issue is that the current “event-structure” is a lot more pervasive in the entire code-base :-( but yes, it will take time to refactor it and having the legacy ones decoupled is a good idea.

* Remove stream-events ssh command from core Gerrit

Yeah ! It should be *one of the event brokering systems* and not the primary way to consume events.

* Write a plugin ("legacy-events") that uses the "something happened" framework and maps this to the the legacy-events lib to construct legacy events and implement a stream-events command equivalent.

+1

* The legacy-events lib could also be used by other event-publishing plugins that are interested in keeping backwards compatibility.




Luca.

>
> -Martin
>
> --
> The Qualcomm Innovation Center, Inc. is a member of Code
> Aurora Forum, hosted by The Linux Foundation
>
> --
> --
> To unsubscribe, email repo-discuss...@googlegroups.com
> More info at http://groups.google.com/group/repo-discuss?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/1667594.b6kNENyT7j%40mfick-lnx.


--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

Sven Selberg

unread,
Mar 18, 2021, 7:15:44 AM3/18/21
to Repo and Gerrit Discussion
My intention was not to solve everything in 3 hours.
I thought it would be beneficial to get an alignment when it comes to some sort of general direction or possibly Road-map items.
This would make it much easier to compile/review/discuss the design-document(s).

Without an initial alignment I feel there's a real risk of design-document-rewrite-fatigue once review starts.
If we haven't aligned on what we want to do, how are we supposed to agree on how we should do it? :-)

Han-Wen Nienhuys

unread,
Mar 18, 2021, 7:33:49 AM3/18/21
to Sven Selberg, Repo and Gerrit Discussion
On Thu, Mar 18, 2021 at 10:29 AM Sven Selberg <sven.s...@axis.com> wrote:
>> Your last point is very important for us, because we are multi-site and the current architecture is an AP system, which is missing the ‘C’ part of the global consistency.
>> That means that when the CI system receives the event, it may NOT know from which instance it was generated and therefore would have no idea *which* Gerrit’s REST-API to call.
>>
>> If you just call *any Gerrit* you may actually get a 404, because the change hasn’t been propagated yet.
>>
>> This is actually happening *also* on *.gs.com and we can clearly see that from the Gerrit-CI builds: at times, even if the Checks API are reporting that a change needs verification, the associated REST-API to Gerrit are failing because the change hasn’t been replicated yet to that node.
>
>
> We have managed to trigger similar behavior with a single-primary setup.
> 1. CI system reacts on event from RabbitMQ
> 2. a micro-service (with limited data) tries to GET the change with a revision SHA1 of latest patchset.
> This round-trip was fast enough so that the index wasn't updated yet so Gerrit couldn't identify the change with the commit sha1 and returned 404.
>
> So the CI resiliency problem is not unique to the push-poll approach of "something happened" and/or multi-primary systems.

If you GET the change using the project~changenum as identifier, the
index is not involved.

Sven Selberg

unread,
Mar 18, 2021, 8:20:41 AM3/18/21
to Repo and Gerrit Discussion
On Thursday, March 18, 2021 at 12:33:49 PM UTC+1 han...@google.com wrote:
On Thu, Mar 18, 2021 at 10:29 AM Sven Selberg <sven.s...@axis.com> wrote:
>> Your last point is very important for us, because we are multi-site and the current architecture is an AP system, which is missing the ‘C’ part of the global consistency.
>> That means that when the CI system receives the event, it may NOT know from which instance it was generated and therefore would have no idea *which* Gerrit’s REST-API to call.
>>
>> If you just call *any Gerrit* you may actually get a 404, because the change hasn’t been propagated yet.
>>
>> This is actually happening *also* on *.gs.com and we can clearly see that from the Gerrit-CI builds: at times, even if the Checks API are reporting that a change needs verification, the associated REST-API to Gerrit are failing because the change hasn’t been replicated yet to that node.
>
>
> We have managed to trigger similar behavior with a single-primary setup.
> 1. CI system reacts on event from RabbitMQ
> 2. a micro-service (with limited data) tries to GET the change with a revision SHA1 of latest patchset.
> This round-trip was fast enough so that the index wasn't updated yet so Gerrit couldn't identify the change with the commit sha1 and returned 404.
>
> So the CI resiliency problem is not unique to the push-poll approach of "something happened" and/or multi-primary systems.

If you GET the change using the project~changenum as identifier, the
index is not involved.
 
Yep, but the way the micro-service was implemented it only had access to project, sha-1. 
This works most of the time so the flaw wasn't noticed until just recently.

FWIW  I don't consider this a Gerrit bug.
Reply all
Reply to author
Forward
0 new messages