Missing code review on instance-2

105 views
Skip to first unread message

Jigar R

unread,
Jan 11, 2021, 4:46:45 PM1/11/21
to Repo and Gerrit Discussion
I have setup 2 gerrit instances on the separate hosts on AWS. Instance1 and instance2 when both are up and running then if I produce code review to either of them it gets replicated to another instance. However, I performed a little experiment:

I turned off instance-1. Created code review on instance2. I waited till instance2 tried out the maximum number of retries.

Setting at instance2:
[remote "replication"]
   url = user@host:path_to_repository
   push = +refs/*:refs/*
   timeout = 600
   rescheduleDelay = 15
   mirror=true
   createMissingRepositories = true
   replicateProjectDeletions = true
   replicateHiddenProjects = true
[gerrit]
   autoReload = true
   replicationOnStartup = false
[replication]
    maxRetries = 5
    lockErrorMaxRetries = 5

When I started instance-1, it was missing the code review. Also both instances write to their local disk. What should I do so that instance-1 pulls code reviews after restart?

I tried to switch replicationOnStartup=false, I ended up with
1) Invalid replication.config: gerrit.replicateOnStartup has to be set to 'false' for multi-site setups


Luca Milanesio

unread,
Jan 11, 2021, 5:09:03 PM1/11/21
to Jigar R, Luca Milanesio, Repo and Gerrit Discussion

On 11 Jan 2021, at 21:46, Jigar R <jigarra...@gmail.com> wrote:

I have setup 2 gerrit instances on the separate hosts on AWS. Instance1 and instance2 when both are up and running then if I produce code review to either of them it gets replicated to another instance. However, I performed a little experiment:

I turned off instance-1. Created code review on instance2. I waited till instance2 tried out the maximum number of retries.

Setting at instance2:
[remote "replication"]
   url = user@host:path_to_repository
   push = +refs/*:refs/*
   timeout = 600
   rescheduleDelay = 15
   mirror=true
   createMissingRepositories = true
   replicateProjectDeletions = true
   replicateHiddenProjects = true
[gerrit]
   autoReload = true
   replicationOnStartup = false
[replication]
    maxRetries = 5
    lockErrorMaxRetries = 5

When I started instance-1, it was missing the code review. Also both instances write to their local disk. What should I do so that instance-1 pulls code reviews after restart?

Are you running the two Gerrit instances in a multi-site configuration? (See [1]).

Luca.



I tried to switch replicationOnStartup=false, I ended up with
1) Invalid replication.config: gerrit.replicateOnStartup has to be set to 'false' for multi-site setups



--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/3a86eb29-7ac2-4b22-9e74-0c3feebb9bb0n%40googlegroups.com.

Jigar R

unread,
Jan 11, 2021, 5:42:27 PM1/11/21
to Repo and Gerrit Discussion
On Monday, January 11, 2021 at 5:09:03 PM UTC-5 lucamilanesio wrote:

On 11 Jan 2021, at 21:46, Jigar R <jigarra...@gmail.com> wrote:

I have setup 2 gerrit instances on the separate hosts on AWS. Instance1 and instance2 when both are up and running then if I produce code review to either of them it gets replicated to another instance. However, I performed a little experiment:

I turned off instance-1. Created code review on instance2. I waited till instance2 tried out the maximum number of retries.

Setting at instance2:
[remote "replication"]
   url = user@host:path_to_repository
   push = +refs/*:refs/*
   timeout = 600
   rescheduleDelay = 15
   mirror=true
   createMissingRepositories = true
   replicateProjectDeletions = true
   replicateHiddenProjects = true
[gerrit]
   autoReload = true
   replicationOnStartup = false
[replication]
    maxRetries = 5
    lockErrorMaxRetries = 5

When I started instance-1, it was missing the code review. Also both instances write to their local disk. What should I do so that instance-1 pulls code reviews after restart?

Are you running the two Gerrit instances in a multi-site configuration? (See [1]).

Luca.



In Gerrit Dashboard, plugin GUI; it says "multi-site" enabled. On the other hand, I do see log file for other plugins i.e. replication_log, sharedref_log. I don't see multi_site_log. I did not see any errors in errors_log.

Under normal condition when both instance1 and instance2 are up and running fine. I can create code reviews on each other and I can see it within few minutes on other instance.

I have following configuration for multi-site.config
[index]
      maxTries = 50
     retryInterval = 30000
[broker]
     indexEventTopic = gerrit_index
     batchIndexEventTopic = gerrit_batch_index
     streamEventTopic = gerrit_stream
     projectListEventTopic = gerrit_list_project
     cacheEventTopic = gerrit_cache_eviction

Luca Milanesio

unread,
Jan 11, 2021, 5:46:14 PM1/11/21
to Jigar R, Luca Milanesio, Repo and Gerrit Discussion
Have you checked the message_log on both Gerrit sites?


I have following configuration for multi-site.config
[index]
      maxTries = 50
     retryInterval = 30000
[broker]
     indexEventTopic = gerrit_index
     batchIndexEventTopic = gerrit_batch_index
     streamEventTopic = gerrit_stream
     projectListEventTopic = gerrit_list_project
     cacheEventTopic = gerrit_cache_eviction

That looks strange: which version are you using?

Luca.




I tried to switch replicationOnStartup=false, I ended up with
1) Invalid replication.config: gerrit.replicateOnStartup has to be set to 'false' for multi-site setups



--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/3a86eb29-7ac2-4b22-9e74-0c3feebb9bb0n%40googlegroups.com.


--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

Jigar R

unread,
Jan 11, 2021, 6:00:59 PM1/11/21
to Repo and Gerrit Discussion
I am using version 3.2 and I copied the configurations from setup_local_env script's setup.

Luca Milanesio

unread,
Jan 11, 2021, 6:04:39 PM1/11/21
to Jigar R, Luca Milanesio, Repo and Gerrit Discussion
And what about the message_log?

Luca.

Jigar R

unread,
Jan 11, 2021, 6:15:31 PM1/11/21
to Repo and Gerrit Discussion
I do see message_log 

Luca Milanesio

unread,
Jan 11, 2021, 6:19:17 PM1/11/21
to Jigar R, Luca Milanesio, Repo and Gerrit Discussion
And do you see the change being propagated across the instances?

Jigar R

unread,
Jan 12, 2021, 1:31:25 PM1/12/21
to Repo and Gerrit Discussion
When both sites are up, everything works like charm. I will keep an eye on message_log now. I turned off instance_1 and created a code review on instance_2; I saw in replication_log that it tried to replicate across but failed. After it exhausted max retries; I started instance_1. The code review was not replicated on instance_1. As shown in issue_3 image attached in the discussion.

Also, somehow I ended up in a scenario where I have rights to verify code on one instance and I don't have the same right on the another one. This is shown in issue_2 image attached.


 
Issue_3.png
Issue_2.png

Jigar R

unread,
Jan 12, 2021, 3:43:03 PM1/12/21
to Repo and Gerrit Discussion
 How can I reschedule operations once push to the other host gets cancelled after maximum retries? 

Jigar R

unread,
Jan 12, 2021, 4:35:08 PM1/12/21
to Repo and Gerrit Discussion
I ran " ssh -p 29418 admin@instance_2 replication  start --now --wait --all" so that instance_1 catches up with the changes. I verified that repositories on instance_1 have caught up but the dashboard is not in-sync. I flushed caches, restarted gerrit.

Luca Milanesio

unread,
Jan 12, 2021, 4:37:49 PM1/12/21
to Jigar R, Luca Milanesio, Repo and Gerrit Discussion
Have you checked the message_log? Do you see the reindexing event propagated on the second instance?

Luca.

Jigar R

unread,
Jan 14, 2021, 3:55:49 PM1/14/21
to Repo and Gerrit Discussion
When I stopped instance_1 and performed operations on instance_2; I see publish events in instance_2's message_log. Afterwards, I started instance_1; there is radio silence in instance_1's message_log. I noticed following on error_log. It seems like instance_1 is trying to figure out change test~4.

[2021-01-14T20:48:40.019+0000] [main] INFO  com.google.gerrit.pgm.Daemon : Gerrit Code Review 3.2.6 ready
[2021-01-14T20:48:55.149+0000] [plugin-manager-preloader] INFO  com.googlesource.gerrit.plugins.manager.OnStartStop : 73 plugins successfully pre-loaded
[2021-01-14T20:49:08.269+0000] [Forwarded-Index-Event-1] WARN  com.googlesource.gerrit.plugins.multisite.forwarder.ForwardedIndexChangeHandler : Change test~41 not present yet in local Git repository (event=Optional[ChangeIndexEvent{eventCreatedOn=2021-01-14T13:58:13, project=test, changeId=41, targetSha=5707de52db10735ccd692951c7f13b18c8d69247, deleted=false}]) after 1 attempt(s)
[2021-01-14T20:49:08.269+0000] [Forwarded-Index-Event-1] WARN  com.googlesource.gerrit.plugins.multisite.forwarder.ForwardedIndexChangeHandler : Retrying for the #2 time to index Change test~41 after 30000 msecs

[2021-01-14T20:49:08.278+0000] [Forwarded-Index-Event-2] WARN  com.googlesource.gerrit.plugins.multisite.forwarder.ForwardedIndexChangeHandler : Change test~41 not present yet in local Git repository (event=Optional[ChangeIndexEvent{eventCreatedOn=2021-01-14T17:57:15, project=test, changeId=41, targetSha=5707de52db10735ccd692951c7f13b18c8d69247, deleted=false}]) after 1 attempt(s)
[2021-01-14T20:49:08.278+0000] [Forwarded-Index-Event-2] WARN  com.googlesource.gerrit.plugins.multisite.forwarder.ForwardedIndexChangeHandler : Retrying for the #2 time to index Change test~41 after 30000 msecs
[2021-01-14T20:49:08.283+0000] [Forwarded-Index-Event-3] WARN  com.googlesource.gerrit.plugins.multisite.forwarder.ForwardedIndexChangeHandler : Change test~41 not present yet in local Git repository (event=Optional[ChangeIndexEvent{eventCreatedOn=2021-01-14T17:57:20, project=test, changeId=41, targetSha=a8a4088ffdd19a7dff504cdab93918aeac766c5e, deleted=false}]) after 1 attempt(s)
[2021-01-14T20:49:08.283+0000] [Forwarded-Index-Event-3] WARN  com.googlesource.gerrit.plugins.multisite.forwarder.ForwardedIndexChangeHandler : Retrying for the #2 time to index Change test~41 after 30000 msecs
[2021-01-14T20:49:38.271+0000] [Forwarded-Index-Event-3] WARN  com.googlesource.gerrit.plugins.multisite.forwarder.ForwardedIndexChangeHandler : Change test~41 not present yet in local Git repository (event=Optional[ChangeIndexEvent{eventCreatedOn=2021-01-14T13:58:13, project=test, changeId=41, targetSha=5707de52db10735ccd692951c7f13b18c8d69247, deleted=false}]) after 2 attempt(s)


[2021-01-14T20:49:38.271+0000] [Forwarded-Index-Event-3] WARN  com.googlesource.gerrit.plugins.multisite.forwarder.ForwardedIndexChangeHandler : Retrying for the #3 time to index Change test~41 after 30000 msecs
[2021-01-14T20:49:38.279+0000] [Forwarded-Index-Event-4] WARN  com.googlesource.gerrit.plugins.multisite.forwarder.ForwardedIndexChangeHandler : Change test~41 not present yet in local Git repository (event=Optional[ChangeIndexEvent{eventCreatedOn=2021-01-14T17:57:15, project=test, changeId=41, targetSha=5707de52db10735ccd692951c7f13b18c8d69247, deleted=false}]) after 2 attempt(s)
[2021-01-14T20:49:38.279+0000] [Forwarded-Index-Event-4] WARN  com.googlesource.gerrit.plugins.multisite.forwarder.ForwardedIndexChangeHandler : Retrying for the #3 time to index Change test~41 after 30000 msecs
[2021-01-14T20:49:38.284+0000] [Forwarded-Index-Event-2] WARN  com.googlesource.gerrit.plugins.multisite.forwarder.ForwardedIndexChangeHandler : Change test~41 not present yet in local Git repository (event=Optional[ChangeIndexEvent{eventCreatedOn=2021-01-14T17:57:20, project=test, changeId=41, targetSha=a8a4088ffdd19a7dff504cdab93918aeac766c5e, deleted=false}]) after 2 attempt(s)
[2021-01-14T20:49:38.284+0000] [Forwarded-Index-Event-2] WARN  com.googlesource.gerrit.plugins.multisite.forwarder.ForwardedIndexChangeHandler : Retrying for the #3 time to index Change test~41 after 30000 msecs

Luca Milanesio

unread,
Jan 14, 2021, 4:36:57 PM1/14/21
to Jigar R, Luca Milanesio, Repo and Gerrit Discussion
Is the change #41 the one you’ve created on instance-1 while instance-2 was down?
If yes, then you need to reschedule replication from instance-1, so that instance-2 can get it.

The retry counts and intervals need to be tuned to allow your maximum planned downtime.

HTH

Luca.

Jigar R

unread,
Jan 15, 2021, 8:29:51 AM1/15/21
to Repo and Gerrit Discussion
I am trying unexpected DR scenario where one instance is either unavailable because of DC outage or host was shutdown unexpectedly. In such case, it's expected that instance would reach max retries and cancels operation. In order to make both gerrit instance in-sync, I had to
- run replication manually using CLI ( does installing healthcheck plugin help here? )
- to sync dashboard, I ran "ssh -p 29418 user@instance gerrit index start changes --force"

Luca Milanesio

unread,
Jan 15, 2021, 10:01:21 AM1/15/21
to Jigar R, Luca Milanesio, Repo and Gerrit Discussion
Thanks for clarifying the scenario.
The situation would *automatically* recover as the system is designed to be self-healing.

In such case, it's expected that instance would reach max retries and cancels operation.

It depends on your configuration: if you need to put together a business continuity plan, you will also need to make assumptions.
What are your assumptions?

In order to make both gerrit instance in-sync, I had to
- run replication manually using CLI ( does installing healthcheck plugin help here? )
- to sync dashboard, I ran "ssh -p 29418 user@instance gerrit index start changes --force"

That is because your configuration doesn’t cover your scenario.
If you can help defining the scenario and the assumptions, it would then be easier to help you out :-)

Luca.

Jigar R

unread,
Jan 15, 2021, 5:07:47 PM1/15/21
to Repo and Gerrit Discussion
I believe self-healing property would apply only when gerrit instance comes back up before other the gerrit instance gives up on pushing gerrit-events.
In such case, it's expected that instance would reach max retries and cancels operation.

It depends on your configuration: if you need to put together a business continuity plan, you will also need to make assumptions.
What are your assumptions?

We would have gerrit in 2 data centers. All the traffics would go to only one data center and second one would just mirror primary gerrit (DC-1 primary). In case of DC outage, all the traffics would be redirected to the mirrored gerrit and it would be promoted as primary now.
When DC is back, ex-primary would sync up. We would trigger replication manually from new-primary gerrit and run reindex which would sync up repositories and dashboard. Afterwards, we would promote ex-primary to primary gerrit.

Assumption here is that, DC outage would last for a while. During DC outage, available gerrit would reach max retrials and would cancel operations.

Luca Milanesio

unread,
Jan 15, 2021, 5:40:44 PM1/15/21
to Jigar R, Luca Milanesio, Repo and Gerrit Discussion
Yes, and that depends on your configuration.
If your multi-site system is configured to self-heal from failures lasting for a max of 24h, then it will auto-recover for 24h outages.

In such case, it's expected that instance would reach max retries and cancels operation.

It depends on your configuration: if you need to put together a business continuity plan, you will also need to make assumptions.
What are your assumptions?

We would have gerrit in 2 data centers. All the traffics would go to only one data center and second one would just mirror primary gerrit (DC-1 primary). In case of DC outage, all the traffics would be redirected to the mirrored gerrit and it would be promoted as primary now.
When DC is back, ex-primary would sync up. We would trigger replication manually from new-primary gerrit and run reindex which would sync up repositories and dashboard. Afterwards, we would promote ex-primary to primary gerrit.

Assumption here is that, DC outage would last for a while. During DC outage, available gerrit would reach max retrials and would cancel operations.

Can you translate your “for a while” with the assumptions you have in your business continuity plan?

You cannot recovery automatically from any type of failure that lasts for any interval of time: it’s just impossible.
For every business continuity plan you have to assume a scenario and plan for it.

Luca.

Reply all
Reply to author
Forward
0 new messages