gerrithub.io --> Github Repo out of sync?

56 views
Skip to first unread message

Will Foster

unread,
Jul 16, 2019, 6:53:48 AM7/16/19
to Repo and Gerrit Discussion
Today we started getting 500 errors when trying to submit changes through gerrithub.io code review --> github.

This has worked flawlessly for years, but recently seems that the upstream Github repository and Gerrithub have become out of sync.

Is this a known problem?  Is there some backend process on the Gerrithub side that corrects this after some time?

For example, our master branch has this as the latest change:

commit 301f244d0022b7aa4f481fefc21cd788e5424dc4 (HEAD -> master, origin/master, origin/HEAD)
Author: Gonzalo Rafuls <grafuls@XXXXXXX.com>
Date:   Mon Jul 15 14:15:36 2019 -0400

   
Fix for verify_switchvonf
   
   
We were missing some logic for verifying the switch conf for public
    vlans
.
   
   
Change-Id: Ieb18437ba213b9bac9638133116c4d07c63e8531



However on the gerrithub.io side this is still showing as it was not merged:


We also cannot submit new changes either, which makes me think the repos are out of sync.


Screenshot_2019-07-16_11-40-41.png



Luca Milanesio

unread,
Jul 16, 2019, 7:30:35 AM7/16/19
to Will Foster, Luca Milanesio, Repo and Gerrit Discussion
Hi Will,
we have introduced in the past few months the multi-site replication on GerritHub.io (see [1]) which means that multiple nodes can accept reads and writes.

In your case, it looks like the node you're talking to is temporarily out of sync and, for preventing split-brain, it refuses further update until he gets back in sync with the others.
(see what split-brain means at [2])

Let look at your change and why ended up in that situation.

Luca.

<Screenshot_2019-07-16_11-40-41.png>




--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/a24a09c2-5234-4c29-991e-798c742272c4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<Screenshot_2019-07-16_11-40-41.png>

Luca Milanesio

unread,
Jul 16, 2019, 7:44:23 AM7/16/19
to Will Foster, Luca Milanesio, Repo and Gerrit Discussion

On 16 Jul 2019, at 12:30, Luca Milanesio <Luca.Mi...@gmail.com> wrote:

Hi Will,
we have introduced in the past few months the multi-site replication on GerritHub.io (see [1]) which means that multiple nodes can accept reads and writes.

In your case, it looks like the node you're talking to is temporarily out of sync and, for preventing split-brain, it refuses further update until he gets back in sync with the others.
(see what split-brain means at [2])

Let look at your change and why ended up in that situation.

I've reconciled the state of the change, and it is now in sync with the rest of the cluster:

The change was actually submitted last night (00:30 BST) on a different node.
Let me know if you have any further issue.

P.S. The problem is linked to a series of problems with the replication plugin, I've posted a few fixes that are under review at the moment (see [3]).

Luca.

Will Foster

unread,
Jul 16, 2019, 8:08:09 AM7/16/19
to Repo and Gerrit Discussion


On Tuesday, July 16, 2019 at 12:44:23 PM UTC+1, lucamilanesio wrote:


More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-d...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/a24a09c2-5234-4c29-991e-798c742272c4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<Screenshot_2019-07-16_11-40-41.png>


Hi Luca,

I am not sure what exactly fixed this though this is what we did which seemed to work about ~20minutes ago.

1) Created a new branch called "temporary" based on the previous commit before it got "confused" within Gerrithub UI repo settings
2) Within Gerrithub UI we moved the patchset/review that was already committed to master and was unable to be abandoned to the new branch "temporary"
3) We were able to abandon that patchset/review then

At this point things started working again.

Also, around this time we had created another branch called "master2" and noticed that both its commit parent hash and the normal "master" branch updated to accurately reflect what was in Github.

Probably this was your work on the backend to sync things up?

Re: splitbrain and multi-node out of sync - is this something that will fix itself should it happen in the future?  If so how long does that normally take?  We were banging our heads on the desk trying to figure out what's wrong for at least an hour ha ha ha.

Thanks for your help, Luca

-will


 

Luca Milanesio

unread,
Jul 16, 2019, 8:22:44 AM7/16/19
to Will Foster, Luca Milanesio, Repo and Gerrit Discussion
Which one did you abandon? (URL?)


At this point things started working again.

Also, around this time we had created another branch called "master2" and noticed that both its commit parent hash and the normal "master" branch updated to accurately reflect what was in Github.

Probably this was your work on the backend to sync things up?

Yes, we have 3 sites at the moment and, for some reasons related to the bugs in the replication plugin, 4 replication events were missed and one of them it was your change.
I've located the missed replication event and triggered it manually: then all the nodes were again in agreement on what's the SHA1 of that ref and your change was back on track.


Re: splitbrain and multi-node out of sync - is this something that will fix itself should it happen in the future?

Once the fixes under review will be merged, it shouldn't happen anymore :-)
We have monitoring on the split-brain events and I could see your attempts this morning:


As you can see from the above, the review-1 node was in disagreement for some refs: I've just re-triggered those events and they went back on track.

  If so how long does that normally take?  We were banging our heads on the desk trying to figure out what's wrong for at least an hour ha ha ha.

The split-brain should *never* happen and that's why I'm keen in getting the fixes merged for the replication plugin.
However, we have protection against those situations and Gerrit will "lock" the refs temporarily.

The issue is also, at the moment, the Gerrit UI has no "nice way" to tell you what's going on ... we're working in improving the user-experience also !


Thanks for your help, Luca

No problem and apologies for the inconvenience.

Luca.

Will Foster

unread,
Jul 16, 2019, 8:45:35 AM7/16/19
to Repo and Gerrit Discussion
This one here:  https://review.gerrithub.io/c/redhat-performance/quads/+/461909 (but later it showed up again as merged).  We were only able to abandon it by moving it into a new branch we created based on an older SHA1 commit that aligned with what Gerrit thought master was at (10a8a6a8efabcdcd4737116d0665448434c8593e vs 301f244d0022b7aa4f481fefc21cd788e5424dc4).

 


At this point things started working again.

Also, around this time we had created another branch called "master2" and noticed that both its commit parent hash and the normal "master" branch updated to accurately reflect what was in Github.

Probably this was your work on the backend to sync things up?

Yes, we have 3 sites at the moment and, for some reasons related to the bugs in the replication plugin, 4 replication events were missed and one of them it was your change.
I've located the missed replication event and triggered it manually: then all the nodes were again in agreement on what's the SHA1 of that ref and your change was back on track.

Thanks for the detailed explanation here.  It makes a lot more sense now what happened.

 


Re: splitbrain and multi-node out of sync - is this something that will fix itself should it happen in the future?

Once the fixes under review will be merged, it shouldn't happen anymore :-)
We have monitoring on the split-brain events and I could see your attempts this morning:


As you can see from the above, the review-1 node was in disagreement for some refs: I've just re-triggered those events and they went back on track.

  If so how long does that normally take?  We were banging our heads on the desk trying to figure out what's wrong for at least an hour ha ha ha.

The split-brain should *never* happen and that's why I'm keen in getting the fixes merged for the replication plugin.
However, we have protection against those situations and Gerrit will "lock" the refs temporarily.

The issue is also, at the moment, the Gerrit UI has no "nice way" to tell you what's going on ... we're working in improving the user-experience also !

Awesome!

 


Thanks for your help, Luca

No problem and apologies for the inconvenience.

Luca.

Hey, there is little inconvenience for a free service but we really appreciate your quick and thorough response and fixes here.  We'll also be looking to get paid support in the future, overall it's been a wonderful service.
 
Reply all
Reply to author
Forward
0 new messages