High Availability setup, cloning issues on replica.

307 views
Skip to first unread message

elzoc...@gmail.com

unread,
May 24, 2021, 10:14:45 AM5/24/21
to Repo and Gerrit Discussion
Hi all,
Status:  Gerrit 3.3.4, all data on a shared AWS EFS.

1x Master 
1x Replica

On the EFS, both master and replica have their own home folder, mounted at /gerrit in both VMs. Then we have shared which is a common folder to both, it will have the websessions and is mounted in /gerrit/shared. And last the git folder, also common to both VMs, mounted in /gerrit/shared/git, but mounted as read-only on the replica, and rw on master.

Replica is kept up-to-date via the HighAvailability plugin.


Only gerritmaster HA is configured to talk with ha creds.

gerritmaster:~$ cat etc/high-availability.config
[main]
  sharedDirectory = /gerrit/shared
[autoReindex]
  enabled = false
[peerInfo]
  strategy = static
[peerInfo "static"]
[http]
 user = hauser
 password = password

gerritreplica:~$ cat etc/high-availability.config
[main]
  sharedDirectory = /gerrit/shared
[autoReindex]
  enabled = false
[peerInfo]
  strategy = static


A month or so ago, on one of the threads I saw a comment saying that since Gerrit 3.3 with HA plugin, both nodes could be serving clones, so last week I decided to test it for our local traffic.

On the ELB, I started redirecting traffic: if host is my IP and git-upload-pack in path or query then send to replica.

I had to add a retry to some of our gerrit-triggered jenkins jobs because I saw some errors like the following:

git fetch origin refs/changes/72/56872/6
fatal: couldn't find remote ref refs/changes/72/56872/6
an error '128' has occurred in the scanning script

Changes had not been re-indexed yet on the replica I think, but after that retry no problem.


During the WE though, some jobs started failing with various errors:


repo sync -d -c -q --no-tags
component/test:
fatal: early EOF
fatal: unpack-objects failed

component/test:
fatal: early EOF
fatal: unpack-objects failed


error: Cannot checkout component/test: ManifestInvalidRevisionError: revision branch1 in component/test not found
error: in `sync -d -c -q --no-tags`: revision branch1 in component/test not found



This morning, one of the component started failing 100% of the time with:
Cloning into 'common'...
remote: Counting objects: 3, done
fatal: The remote end hung up unexpectedly
fatal: early EOFs:  75% (790/1053)
fatal: index-pack failed


So I had to remove the ELB redirection to get everything back to gerritmaster.


That last error, everytime invoked the clone, I had the following trace on the replica server error_log:


[2021-05-24T13:10:49.968+01:00] [HTTP POST /r/component/common/git-upload-pack (Cedric from 10.1.0.78)] WARN  org.eclipse.jetty.server.handler.ContextHandler.r : Internal error during upload-pack from /gerrit/shared/git/component/common.git
java.io.IOException: Stale file handle
at java.base/java.io.RandomAccessFile.readBytes(Native Method)
at java.base/java.io.RandomAccessFile.read(RandomAccessFile.java:406)
at java.base/java.io.RandomAccessFile.readFully(RandomAccessFile.java:470)
at org.eclipse.jgit.internal.storage.file.PackFile.read(PackFile.java:725)
at org.eclipse.jgit.internal.storage.file.WindowCache.load(WindowCache.java:516)
at org.eclipse.jgit.internal.storage.file.WindowCache.getOrLoad(WindowCache.java:603)
at org.eclipse.jgit.internal.storage.file.WindowCache.get(WindowCache.java:386)
at org.eclipse.jgit.internal.storage.file.WindowCursor.pin(WindowCursor.java:327)
at org.eclipse.jgit.internal.storage.file.WindowCursor.copyPackAsIs(WindowCursor.java:247)
at org.eclipse.jgit.internal.storage.file.PackFile.copyPackAsIs(PackFile.java:391)
at org.eclipse.jgit.internal.storage.file.LocalCachedPack.copyAsIs(LocalCachedPack.java:53)
at org.eclipse.jgit.internal.storage.file.WindowCursor.copyPackAsIs(WindowCursor.java:239)
at org.eclipse.jgit.internal.storage.pack.PackWriter.writePack(PackWriter.java:1242)
at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:2312)
at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:2143)
at org.eclipse.jgit.transport.UploadPack.service(UploadPack.java:1077)
at org.eclipse.jgit.transport.UploadPack.uploadWithExceptionPropagation(UploadPack.java:834)
at org.eclipse.jgit.http.server.UploadPackServlet.lambda$doPost$0(UploadPackServlet.java:184)
at org.eclipse.jgit.http.server.UploadPackServlet.defaultUploadPackHandler(UploadPackServlet.java:207)
at org.eclipse.jgit.http.server.UploadPackServlet.doPost(UploadPackServlet.java:201)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:661)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
at org.eclipse.jgit.http.server.glue.UrlPipeline$Chain.doFilter(UrlPipeline.java:211)
at com.google.gerrit.httpd.GitOverHttpServlet$UploadFilter.doFilter(GitOverHttpServlet.java:435)
at org.eclipse.jgit.http.server.glue.UrlPipeline$Chain.doFilter(UrlPipeline.java:209)
at org.eclipse.jgit.http.server.UploadPackServlet$Factory.doFilter(UploadPackServlet.java:137)
at org.eclipse.jgit.http.server.glue.UrlPipeline$Chain.doFilter(UrlPipeline.java:209)
at org.eclipse.jgit.http.server.RepositoryFilter.doFilter(RepositoryFilter.java:112)
at org.eclipse.jgit.http.server.glue.UrlPipeline$Chain.doFilter(UrlPipeline.java:209)
at org.eclipse.jgit.http.server.NoCacheFilter.doFilter(NoCacheFilter.java:53)
at org.eclipse.jgit.http.server.glue.UrlPipeline$Chain.doFilter(UrlPipeline.java:209)
at org.eclipse.jgit.http.server.glue.UrlPipeline.service(UrlPipeline.java:188)
at org.eclipse.jgit.http.server.glue.SuffixPipeline.service(SuffixPipeline.java:70)
at org.eclipse.jgit.http.server.glue.MetaFilter.doFilter(MetaFilter.java:150)
at org.eclipse.jgit.http.server.glue.MetaServlet.service(MetaServlet.java:109)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
at com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:290)
at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:280)
at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:184)
at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:89)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85)
at com.google.gerrit.httpd.raw.StaticModule$PolyGerritFilter.doFilter(StaticModule.java:390)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.gerrit.httpd.GetUserFilter.doFilter(GetUserFilter.java:92)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.gerrit.httpd.RequireSslFilter.doFilter(RequireSslFilter.java:72)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.gerrit.httpd.RunAsFilter.doFilter(RunAsFilter.java:120)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.gerrit.httpd.SetThreadNameFilter.doFilter(SetThreadNameFilter.java:62)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.gerrit.httpd.AllRequestFilter$FilterProxy$1.doFilter(AllRequestFilter.java:139)
at com.googlesource.gerrit.plugins.readonly.ReadOnly.doFilter(ReadOnly.java:79)
at com.google.gerrit.httpd.AllRequestFilter$FilterProxy$1.doFilter(AllRequestFilter.java:135)
at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:239)
at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:215)
at com.googlesource.gerrit.plugins.javamelody.GerritMonitoringFilter.doFilter(GerritMonitoringFilter.java:66)
at com.google.gerrit.httpd.AllRequestFilter$FilterProxy$1.doFilter(AllRequestFilter.java:135)
at com.google.gerrit.httpd.AllowRenderInFrameFilter.doFilter(AllowRenderInFrameFilter.java:56)
at com.google.gerrit.httpd.AllRequestFilter$FilterProxy$1.doFilter(AllRequestFilter.java:135)
at com.google.gerrit.httpd.AllRequestFilter$FilterProxy.doFilter(AllRequestFilter.java:141)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.gerrit.httpd.RequestCleanupFilter.doFilter(RequestCleanupFilter.java:60)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.gerrit.pgm.http.jetty.ProjectQoSFilter.doFilter(ProjectQoSFilter.java:182)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.gerrit.httpd.ProjectBasicAuthFilter.doFilter(ProjectBasicAuthFilter.java:105)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.gerrit.httpd.RequestMetricsFilter.doFilter(RequestMetricsFilter.java:57)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.gerrit.httpd.RequestContextFilter.doFilter(RequestContextFilter.java:64)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:121)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:133)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.RequestLogHandler.handle(RequestLogHandler.java:54)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.Server.handleAsync(Server.java:559)
at org.eclipse.jetty.server.HttpChannel.lambda$handle$2(HttpChannel.java:396)
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:396)
at org.eclipse.jetty.server.HttpChannel.run(HttpChannel.java:340)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:773)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:905)
at java.base/java.lang.Thread.run(Thread.java:829)



Note: A similar error appears in the error_log, but I don't think it caused an error on the git client side (I can't be sure because there were lots of them at that time, for only a few failures) 

[2021-05-23T03:26:30.148+01:00] [HTTP POST /r/docs/git-upload-pack (jenkins from xx.xx.xx.xxx)] WARN  org.eclipse.jgit.internal.storage.file.ObjectDirectory : Pack file /gerrit/shared/git/docs.git/objects/pack/pack-226d13      165970d2f9c33aaeeb8d14cef05494ab62.pack handle is stale, removing it from pack list
 9110 java.io.IOException: Stale file handle   
at java.base/java.io.RandomAccessFile.readBytes(Native Method)
at java.base/java.io.RandomAccessFile.read(RandomAccessFile.java:406)
at java.base/java.io.RandomAccessFile.readFully(RandomAccessFile.java:470)
at org.eclipse.jgit.internal.storage.file.PackFile.read(PackFile.java:725)
at org.eclipse.jgit.internal.storage.file.WindowCache.load(WindowCache.java:516)
at org.eclipse.jgit.internal.storage.file.WindowCache.getOrLoad(WindowCache.java:603)
at org.eclipse.jgit.internal.storage.file.WindowCache.get(WindowCache.java:386)
at org.eclipse.jgit.internal.storage.file.WindowCursor.pin(WindowCursor.java:327)
at org.eclipse.jgit.internal.storage.file.WindowCursor.copy(WindowCursor.java:226)
at org.eclipse.jgit.internal.storage.file.PackFile.readFully(PackFile.java:614)
at org.eclipse.jgit.internal.storage.file.PackFile.representation(PackFile.java:1085)
at org.eclipse.jgit.internal.storage.file.ObjectDirectory.selectObjectRepresentation(ObjectDirectory.java:593)
at org.eclipse.jgit.internal.storage.file.ObjectDirectory.selectObjectRepresentation(ObjectDirectory.java:584)
at org.eclipse.jgit.internal.storage.file.WindowCursor.selectObjectRepresentation(WindowCursor.java:177)
at org.eclipse.jgit.internal.storage.pack.PackWriter.searchForReuse(PackWriter.java:1343)
at org.eclipse.jgit.internal.storage.pack.PackWriter.searchForReuse(PackWriter.java:1317)
at org.eclipse.jgit.internal.storage.pack.PackWriter.writePack(PackWriter.java:1178)
at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:2312)
at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:2143)
at org.eclipse.jgit.transport.UploadPack.fetchV2(UploadPack.java:1241)
at org.eclipse.jgit.transport.UploadPack.serveOneCommandV2(UploadPack.java:1278)
at org.eclipse.jgit.transport.UploadPack.serviceV2(UploadPack.java:1325)
at org.eclipse.jgit.transport.UploadPack.uploadWithExceptionPropagation(UploadPack.java:832)
at org.eclipse.jgit.http.server.UploadPackServlet.lambda$doPost$0(UploadPackServlet.java:184)
at org.eclipse.jgit.http.server.UploadPackServlet.defaultUploadPackHandler(UploadPackServlet.java:207)
at org.eclipse.jgit.http.server.UploadPackServlet.doPost(UploadPackServlet.java:201)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:661)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
at org.eclipse.jgit.http.server.glue.UrlPipeline$Chain.doFilter(UrlPipeline.java:211)
at com.google.gerrit.httpd.GitOverHttpServlet$UploadFilter.doFilter(GitOverHttpServlet.java:435)
at org.eclipse.jgit.http.server.glue.UrlPipeline$Chain.doFilter(UrlPipeline.java:209)
at org.eclipse.jgit.http.server.UploadPackServlet$Factory.doFilter(UploadPackServlet.java:137)
at org.eclipse.jgit.http.server.glue.UrlPipeline$Chain.doFilter(UrlPipeline.java:209)
at org.eclipse.jgit.http.server.RepositoryFilter.doFilter(RepositoryFilter.java:112)
at org.eclipse.jgit.http.server.glue.UrlPipeline$Chain.doFilter(UrlPipeline.java:209)
at org.eclipse.jgit.http.server.NoCacheFilter.doFilter(NoCacheFilter.java:53)
at org.eclipse.jgit.http.server.glue.UrlPipeline$Chain.doFilter(UrlPipeline.java:209)
at org.eclipse.jgit.http.server.glue.UrlPipeline.service(UrlPipeline.java:188)
at org.eclipse.jgit.http.server.glue.SuffixPipeline.service(SuffixPipeline.java:70)
at org.eclipse.jgit.http.server.glue.MetaFilter.doFilter(MetaFilter.java:150)
at org.eclipse.jgit.http.server.glue.MetaServlet.service(MetaServlet.java:109)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:742)
at com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:290)
at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:280)
at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:184)
at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:89)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85)
at com.google.gerrit.httpd.raw.StaticModule$PolyGerritFilter.doFilter(StaticModule.java:387)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.gerrit.httpd.GetUserFilter.doFilter(GetUserFilter.java:92)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.gerrit.httpd.RequireSslFilter.doFilter(RequireSslFilter.java:72)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.gerrit.httpd.RunAsFilter.doFilter(RunAsFilter.java:120)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.gerrit.httpd.SetThreadNameFilter.doFilter(SetThreadNameFilter.java:62)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.gerrit.httpd.AllRequestFilter$FilterProxy$1.doFilter(AllRequestFilter.java:139)
at com.googlesource.gerrit.plugins.readonly.ReadOnly.doFilter(ReadOnly.java:79)
at com.google.gerrit.httpd.AllRequestFilter$FilterProxy$1.doFilter(AllRequestFilter.java:135)
at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:239)
at net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:215)
at com.googlesource.gerrit.plugins.javamelody.GerritMonitoringFilter.doFilter(GerritMonitoringFilter.java:66)
at com.google.gerrit.httpd.AllRequestFilter$FilterProxy$1.doFilter(AllRequestFilter.java:135)
at com.google.gerrit.httpd.AllowRenderInFrameFilter.doFilter(AllowRenderInFrameFilter.java:56)
at com.google.gerrit.httpd.AllRequestFilter$FilterProxy$1.doFilter(AllRequestFilter.java:135)
at com.google.gerrit.httpd.AllRequestFilter$FilterProxy.doFilter(AllRequestFilter.java:141)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.gerrit.httpd.RequestCleanupFilter.doFilter(RequestCleanupFilter.java:60)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.gerrit.pgm.http.jetty.ProjectQoSFilter.doFilter(ProjectQoSFilter.java:182)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.gerrit.httpd.ProjectBasicAuthFilter.doFilter(ProjectBasicAuthFilter.java:103)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.gerrit.httpd.RequestMetricsFilter.doFilter(RequestMetricsFilter.java:57)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.gerrit.httpd.RequestContextFilter.doFilter(RequestContextFilter.java:64)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:121)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:133)
at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1435)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1350)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.RequestLogHandler.handle(RequestLogHandler.java:54)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.Server.handleAsync(Server.java:559)
at org.eclipse.jetty.server.HttpChannel.lambda$handle$2(HttpChannel.java:396)
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:396)
at org.eclipse.jetty.server.HttpChannel.run(HttpChannel.java:340)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:773)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:905)
at java.base/java.lang.Thread.run(Thread.java:829)


Note2: an hour has passed since I removed the ELB redirection, and now, if I force the same clone from the replica, it works...


Any idea ?

Thanks,
Cedric.

Luca Milanesio

unread,
May 24, 2021, 3:25:46 PM5/24/21
to elzoc...@gmail.com, Luca Milanesio, Repo and Gerrit Discussion

On 24 May 2021, at 15:14, elzoc...@gmail.com <elzoc...@gmail.com> wrote:

Hi all,
Status:  Gerrit 3.3.4, all data on a shared AWS EFS.

1x Master 
1x Replica

The above doesn’t really look like an HA setup, as the two nodes are not identical.


On the EFS, both master and replica have their own home folder, mounted at /gerrit in both VMs. Then we have shared which is a common folder to both, it will have the websessions and is mounted in /gerrit/shared. And last the git folder, also common to both VMs, mounted in /gerrit/shared/git, but mounted as read-only on the replica, and rw on master.

Replica is kept up-to-date via the HighAvailability plugin.


Only gerritmaster HA is configured to talk with ha creds.

gerritmaster:~$ cat etc/high-availability.config
[main]
  sharedDirectory = /gerrit/shared
[autoReindex]
  enabled = false
[peerInfo]
  strategy = static
[peerInfo "static"]
[http]
 user = hauser
 password = password

gerritreplica:~$ cat etc/high-availability.config
[main]
  sharedDirectory = /gerrit/shared
[autoReindex]
  enabled = false
[peerInfo]
  strategy = static

Very strange indeed: what would you expect the HA plugin to do on the Gerrit replica?

A month or so ago, on one of the threads I saw a comment saying that since Gerrit 3.3 with HA plugin, both nodes could be serving clones, so last week I decided to test it for our local traffic.

On the ELB, I started redirecting traffic: if host is my IP and git-upload-pack in path or query then send to replica.

I had to add a retry to some of our gerrit-triggered jenkins jobs because I saw some errors like the following:

git fetch origin refs/changes/72/56872/6
fatal: couldn't find remote ref refs/changes/72/56872/6
an error '128' has occurred in the scanning script

Changes had not been re-indexed yet on the replica I think, but after that retry no problem.

Gerrit replicas do not have indexes, do not have sessions, do not have a GUI and do not serve REST-API, hence the HA plugin would have basically nothing to do there.

Has the system ever worked?
(I mean in HA mode: the first master goes down and the second replica is able to manage the incoming traffic)

The “stale file handle” errors are 100% expected and are caused by the NFS sharing.


Note2: an hour has passed since I removed the ELB redirection, and now, if I force the same clone from the replica, it works...


Any idea ?

If you could spend some time explaining a bit further how the systems ever worked, then I would be able to share my ideas about it :-)

Luca.


Thanks,
Cedric.

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/f917b75c-d596-4b2c-8278-fd6c0e191629n%40googlegroups.com.

elzoc...@gmail.com

unread,
May 24, 2021, 4:18:13 PM5/24/21
to Repo and Gerrit Discussion
Hi Luca, 

Thanks for your reply.
Sorry for the confusion, I forgot there is a container.replica setting, we are not using it. By replica I meant a copy of master (with a read only git folder), will use Gerrit2 in my answers bellow.

I think our secondary instance was named gerritreplica because of its AWS Rds which was a read-only replica of the one from Master, but that was way before my time on the project.

Le lundi 24 mai 2021 à 20:25:46 UTC+1, lucamilanesio a écrit :

On 24 May 2021, at 15:14, elzoc...@gmail.com <elzoc...@gmail.com> wrote:

Hi all,
Status:  Gerrit 3.3.4, all data on a shared AWS EFS.

1x Master 
1x Replica

The above doesn’t really look like an HA setup, as the two nodes are not identical.

So, 1x Master,  1x Gerrit2.

Gerrit2 being a copy of Master ages ago and kept up-to-date via the HA plugin, it's part of a blue/green deployment.

 


On the EFS, both master and replica have their own home folder, mounted at /gerrit in both VMs. Then we have shared which is a common folder to both, it will have the websessions and is mounted in /gerrit/shared. And last the git folder, also common to both VMs, mounted in /gerrit/shared/git, but mounted as read-only on the replica, and rw on master.

Replica is kept up-to-date via the HighAvailability plugin.


Only gerritmaster HA is configured to talk with ha creds.

gerritmaster:~$ cat etc/high-availability.config
[main]
  sharedDirectory = /gerrit/shared
[autoReindex]
  enabled = false
[peerInfo]
  strategy = static
[peerInfo "static"]
[http]
 user = hauser
 password = password

gerritreplica:~$ cat etc/high-availability.config
[main]
  sharedDirectory = /gerrit/shared
[autoReindex]
  enabled = false
[peerInfo]
  strategy = static

Very strange indeed: what would you expect the HA plugin to do on the Gerrit replica?
 
 All traffic gets redirected to "Gerrit2" when a problem is detected on the master. UI and changes are still present and updated, but no write can work sure to the read only filesystem.


A month or so ago, on one of the threads I saw a comment saying that since Gerrit 3.3 with HA plugin, both nodes could be serving clones, so last week I decided to test it for our local traffic.

On the ELB, I started redirecting traffic: if host is my IP and git-upload-pack in path or query then send to replica.

I had to add a retry to some of our gerrit-triggered jenkins jobs because I saw some errors like the following:

git fetch origin refs/changes/72/56872/6
fatal: couldn't find remote ref refs/changes/72/56872/6
an error '128' has occurred in the scanning script

Changes had not been re-indexed yet on the replica I think, but after that retry no problem.

Gerrit replicas do not have indexes, do not have sessions, do not have a GUI and do not serve REST-API, hence the HA plugin would have basically nothing to do there.

Has the system ever worked?
(I mean in HA mode: the first master goes down and the second replica is able to manage the incoming traffic)



During the WE though, some jobs started failing with various errors:


repo sync -d -c -q --no-tags
component/test:
fatal: early EOF
fatal: unpack-objects failed

component/test:
fatal: early EOF
fatal: unpack-objects failed


error: Cannot checkout component/test: ManifestInvalidRevisionError: revision branch1 in component/test not found
error: in `sync -d -c -q --no-tags`: revision branch1 in component/test not found



Those errors above happened a few time while cloning against Gerrit2, the manifest is pointing to 

<project name="component/test" revision="branch1"/>

and branch1 has been present in that repo for years.
OK tks, the first trace above (where there is no "removing it from pack list") is the direct result of a clone against the head (master) which failed with "index pack failed", how should that be handled?
2 jenkins jobs failed to clone the repo 20min apart, I then manually tried to re-clone the repo and every time it failed with that same trace. then an hour later it was working fine.


Note2: an hour has passed since I removed the ELB redirection, and now, if I force the same clone from the replica, it works...


Any idea ?

If you could spend some time explaining a bit further how the systems ever worked, then I would be able to share my ideas about it :-)


Thanks again!

Luca Milanesio

unread,
May 25, 2021, 5:31:28 AM5/25/21
to Repo and Gerrit Discussion, Luca Milanesio

On 24 May 2021, at 21:18, elzoc...@gmail.com <elzoc...@gmail.com> wrote:

Hi Luca, 

Thanks for your reply.
Sorry for the confusion, I forgot there is a container.replica setting, we are not using it. By replica I meant a copy of master (with a read only git folder), will use Gerrit2 in my answers bellow.

Gotcha.


I think our secondary instance was named gerritreplica because of its AWS Rds which was a read-only replica of the one from Master, but that was way before my time on the project.

I see, we are putting together a glossary in Gerrit so that we can avoid misunderstanding in the future :-)
In Gerrit we call “read-only replica” what it was previously called “slave”, a term that we don’t use anymore as considered disrespectful by the community.

In your case, it is an identical copy of Gerrit with a read-only filesystem.


Le lundi 24 mai 2021 à 20:25:46 UTC+1, lucamilanesio a écrit :

On 24 May 2021, at 15:14, elzoc...@gmail.com <elzoc...@gmail.com> wrote:

Hi all,
Status:  Gerrit 3.3.4, all data on a shared AWS EFS.

1x Master 
1x Replica

The above doesn’t really look like an HA setup, as the two nodes are not identical.

So, 1x Master,  1x Gerrit2.

Based on what you mentioned, it is 2x Gerrit primary nodes (we don’t use the ‘master’ term either).


Gerrit2 being a copy of Master ages ago and kept up-to-date via the HA plugin, it's part of a blue/green deployment.

+1


 


On the EFS, both master and replica have their own home folder, mounted at /gerrit in both VMs. Then we have shared which is a common folder to both, it will have the websessions and is mounted in /gerrit/shared. And last the git folder, also common to both VMs, mounted in /gerrit/shared/git, but mounted as read-only on the replica, and rw on master.

Replica is kept up-to-date via the HighAvailability plugin.


Only gerritmaster HA is configured to talk with ha creds.

gerritmaster:~$ cat etc/high-availability.config
[main]
  sharedDirectory = /gerrit/shared
[autoReindex]
  enabled = false
[peerInfo]
  strategy = static
[peerInfo "static"]
[http]
 user = hauser
 password = password

gerritreplica:~$ cat etc/high-availability.config
[main]
  sharedDirectory = /gerrit/shared
[autoReindex]
  enabled = false
[peerInfo]
  strategy = static

Very strange indeed: what would you expect the HA plugin to do on the Gerrit replica?
 
 All traffic gets redirected to "Gerrit2" when a problem is detected on the master. UI and changes are still present and updated, but no write can work sure to the read only filesystem.

How do you switch on the read-only filesystem into a read-write automatically?
I have actually seen this issue yesterday is one of our systems.
Which Java VM version are you using?



Note2: an hour has passed since I removed the ELB redirection, and now, if I force the same clone from the replica, it works...


Any idea ?

If you could spend some time explaining a bit further how the systems ever worked, then I would be able to share my ideas about it :-)


Thanks again!

No problem :-)

Luca.

 
Luca.


Thanks,
Cedric.

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/f917b75c-d596-4b2c-8278-fd6c0e191629n%40googlegroups.com.


--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

elzoc...@gmail.com

unread,
May 25, 2021, 6:15:44 AM5/25/21
to Repo and Gerrit Discussion
We do not.
That secondary instance as always been read-only, and it's goal was mostly to maintain the cloning activity in case of a problem on master. 
Before NoteDB / PolyGerrit, we were not using the HA plugin at all, we just had a footer masking way out of date dashboards. 
When I added the HA plugin I thought of making that instance rw, and give it credentials to communicate with main, but never took the step.  
openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.18.04)
OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.18.04, mixed mode, sharing)


I found [1] after your comment on NFS sharing yesterday, and added trustFolderStat=false to my etc/jgit.config this morning before re-redirecting my local traffic back to the read-only secondary instance, will see if that improve things.

 



Note2: an hour has passed since I removed the ELB redirection, and now, if I force the same clone from the replica, it works...


Any idea ?

If you could spend some time explaining a bit further how the systems ever worked, then I would be able to share my ideas about it :-)


Thanks again!

No problem :-)

Luca.

 
Luca.


Thanks,
Cedric.




Luca Milanesio

unread,
May 25, 2021, 7:39:49 AM5/25/21
to Repo and Gerrit Discussion, Luca Milanesio, elzoc...@gmail.com
I believe there must be a bug in JGit on that: I actually have found the same issue in my own setup.
Going to dig a bit deeper and see where the problem is.

*It* should have automatically recognised the “stale file handle” and quarantined the pack file, but it didn’t do it in my case and, possibly, you have the same problem :-(

I found [1] after your comment on NFS sharing yesterday, and added trustFolderStat=false to my etc/jgit.config this morning before re-redirecting my local traffic back to the read-only secondary instance, will see if that improve things.

It might, but most likely it won’t fix the stale file handle issue IMHO.

Luca.

elzoc...@gmail.com

unread,
May 25, 2021, 9:43:53 AM5/25/21
to Repo and Gerrit Discussion
Just to understand, a quarantine here would mean moving the pack file somewhere else? and if so who's *it*?
Asking because my secondary instance has it's git folder (/gerrit/shared/git) mounted as read-only, so any "move" would be impossible. 
 
I found [1] after your comment on NFS sharing yesterday, and added trustFolderStat=false to my etc/jgit.config this morning before re-redirecting my local traffic back to the read-only secondary instance, will see if that improve things.

It might, but most likely it won’t fix the stale file handle issue IMHO.

Haven't seen any stale file handle in the past 5 hours but I guess it will more likely happen after the primary nightly Gerrit git gc runs.., will continue to monitor. 
Note: it looks (Grafana graphs) like that setting take its toll on the CPU / system load average.

Luca Milanesio

unread,
May 25, 2021, 2:27:47 PM5/25/21
to Repo and Gerrit Discussion, Luca Milanesio
Quarantine = remove from the in-memory pack list for processing.
It = JGit

Asking because my secondary instance has it's git folder (/gerrit/shared/git) mounted as read-only, so any "move" would be impossible.

That isn’t needed, just it should be ignored by JGit and *IT WAS* in the past … possibly something broke recently :-(

Luca.

Luca Milanesio

unread,
May 25, 2021, 7:32:42 PM5/25/21
to Repo and Gerrit Discussion, Luca Milanesio
I’ve reproduced the issue and filed a bug on JGIt:

Luca.

Luca Milanesio

unread,
May 25, 2021, 8:13:34 PM5/25/21
to Repo and Gerrit Discussion, Luca Milanesio
… and I am working on a fix :-)
It is most likely a regression in JGit: I remember it was working fine and was able to silently manage the “stale file handle” exception in the past.

Luca.

elzoc...@gmail.com

unread,
May 26, 2021, 8:08:10 AM5/26/21
to Repo and Gerrit Discussion
OK thanks for the info.

I’ve reproduced the issue and filed a bug on JGIt:

… and I am working on a fix :-)
It is most likely a regression in JGit: I remember it was working fine and was able to silently manage the “stale file handle” exception in the past.


tks :)

elzoc...@gmail.com

unread,
Jun 15, 2021, 9:58:29 AM6/15/21
to Repo and Gerrit Discussion
Questions, is there a way to flush that in-memory pack list? (other than restarting the replicas)
To work around, I've switched our GC to once a week on Sundays, that way I can monitor the GC and restart the replicas immediately after but it's not really maintainable in the long term,
and not sure GC-ing once a week is that good either.
Is there a better way to work with replicas?

Many thanks,

luca.mi...@gmail.com

unread,
Jun 15, 2021, 10:29:57 AM6/15/21
to elzoc...@gmail.com, Repo and Gerrit Discussion


Sent from my iPhone

On 15 Jun 2021, at 14:58, elzoc...@gmail.com <elzoc...@gmail.com> wrote:



Unfortunately not :-( the only way to reset the JGit cache is a JVM restart AFAIK.

@Matthias can you confirm?

Luca

Matthias Sohn

unread,
Jun 15, 2021, 3:59:09 PM6/15/21
to Luca Milanesio, elzoc...@gmail.com, Repo and Gerrit Discussion
you can't reset the transient pack list from outside

elzoc...@gmail.com

unread,
Jun 17, 2021, 5:41:38 AM6/17/21
to Repo and Gerrit Discussion
Ok, thanks for the info.   what's the reliable way to get some replica running, how do you get around that issue? 
via the replication plugin to replicas with their own git filesystem ?  (probably not, I think that would had some massive load, with thousands of projects, a multiple replicas..)

Thanks again,
 Cedric.

Luca Milanesio

unread,
Jun 17, 2021, 6:09:31 AM6/17/21
to elzoc...@gmail.com, Luca Milanesio, Repo and Gerrit Discussion
I know it is not ideal, but unfortunately now it is the only solution, unless you have lots of disk-space and you always preserve the old packfiles during GC (--preserve-oldpacks and --prune-preserved)
I am working on a solution on the JGit code-base and I got some initial feedback.

Luca.

elzoc...@gmail.com

unread,
Aug 2, 2021, 2:40:58 AM8/2/21
to Repo and Gerrit Discussion
Hi Luca,

I saw "JGit Issue 573791: Stale file handle raised when loading a collection of notes with a NoteMap over NFS" in the release note on 3.3.5, so did the upgrade last week.

Our weekly GC happened yesterday on our main server Gerrit-1, did a few tests and checks on our read-only (shared git FS) Gerrit-2 afterward, it looked ok, with only some expected "...xx.pack handle is stale, removing it from pack list" warnings.

During the night though we had some clone failures against Gerrit-2, with errors [1] on a few repos.

Manual clones I did before restarting the server: no 5xx shows in httpd_log [2], an error in error_log [1], a git client failure [3].

Tks,



[1] error_log
[2021-08-02T06:57:58.139+01:00] [HTTP POST /r/test_component/git-upload-pack (N/A from 10.1.1.184)] WARN  org.eclipse.jetty.server.handler.ContextHandler.r : Internal error during upload-pack from /gerrit/shared/git/test_component.git
at com.google.gerrit.pgm.http.jetty.ProjectQoSFilter.doFilter(ProjectQoSFilter.java:183)
[2] httpd_log
10.1.1.184 [HTTP-205498] - - [2021-08-02T06:57:58.000+01:00] "GET /r/test_component/info/refs?service=git-upload-pack HTTP/1.1" 200 2237 22 - "git/2.24.3 (Apple Git-128)"
10.1.1.184 [HTTP-205754] - - [2021-08-02T06:57:58.140+01:00] "POST /r/test_component/git-upload-pack HTTP/1.1" 200 65604 17 - "git/2.24.3 (Apple Git-128)"

[3] git client
08.02 06:57:58 /tmp> git clone http://gerrit2:8081/r/test_component
Cloning into 'test_component'...
remote: Counting objects: 6, done
fatal: the remote end hung up unexpectedly
fatal: early EOFs:  68% (88/129)
fatal: index-pack failed

Luca Milanesio

unread,
Aug 2, 2021, 3:18:20 PM8/2/21
to elzoc...@gmail.com, Luca Milanesio, Repo and Gerrit Discussion
Yep, I had those also and @Ponch has developed a fix for that condition as well in [4].

Hope it will get merged soon :-)

Luca.


Reply all
Reply to author
Forward
0 new messages