Replication failures to git mirrors over a WAN

Anthony Russello

unread,

Jul 26, 2010, 11:37:12 PM7/26/10

to Repo and Gerrit Discussion

Hi,

We have several git mirrors that Gerrit is configured to push to
across the WAN throughout the US and Asia. We are frequently seeing
messages such as this:

[2010-07-26 22:00:44,195] ERROR
com.google.gerrit.server.git.PushReplication : Cannot replicate to
gerrit2@<ip>:/home/gerrit2/repositories/Test.git
org.eclipse.jgit.errors.TransportException: Read timed out
at
org.eclipse.jgit.transport.BasePackConnection.readAdvertisedRefs(BasePackConnection.java:
148)
at org.eclipse.jgit.transport.TransportGitSsh
$SshPushConnection.<init>(TransportGitSsh.java:365)
at
org.eclipse.jgit.transport.TransportGitSsh.openPush(TransportGitSsh.java:
97)
at org.eclipse.jgit.transport.PushProcess.execute(PushProcess.java:
119)
at org.eclipse.jgit.transport.Transport.push(Transport.java:866)
at com.google.gerrit.server.git.PushOp.pushVia(PushOp.java:193)
at com.google.gerrit.server.git.PushOp.runImpl(PushOp.java:146)
at com.google.gerrit.server.git.PushOp.run(PushOp.java:102)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:
441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ScheduledThreadPoolExecutor
$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
at java.util.concurrent.ScheduledThreadPoolExecutor
$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207)
at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:
310)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.InterruptedIOException: Read timed out
at
org.eclipse.jgit.util.io.TimeoutInputStream.readTimedOut(TimeoutInputStream.java:
131)
at
org.eclipse.jgit.util.io.TimeoutInputStream.read(TimeoutInputStream.java:
104)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at org.eclipse.jgit.util.IO.readFully(IO.java:122)
at
org.eclipse.jgit.transport.PacketLineIn.readLength(PacketLineIn.java:
120)
at
org.eclipse.jgit.transport.PacketLineIn.readString(PacketLineIn.java:
92)
at
org.eclipse.jgit.transport.BasePackConnection.readAdvertisedRefsImpl(BasePackConnection.java:
161)
at
org.eclipse.jgit.transport.BasePackConnection.readAdvertisedRefs(BasePackConnection.java:
142)
... 16 more

These errors occur randomly.

We have tried a great deal of tuning options, but I was hoping for
some guidance on turning the git mirrors to prevent these issues from
occurring. As it stands, we have an additional scheduled rsync that
runs every 6 hours to catch the failures on the mirror sites, but this
leaves us with gaps often times.

We have one mirror located in Asia that replicates flawlessly, but
another one drops occasionally. Another mirror in the US drops very
frequently. While we understand that a lot of this could be
contributed to network issues, is there any way to configure gerrit,
or the mirrors, to retry and hopefully provided a lower rate of
replication failures?

Any assistance or guidance would be greatly appreciated.

Thanks,
Anthony

Shawn Pearce

unread,

Jul 27, 2010, 10:12:44 AM7/27/10

to Anthony Russello, Repo and Gerrit Discussion

On Mon, Jul 26, 2010 at 20:37, Anthony Russello <arus...@gmail.com> wrote:
> We have several git mirrors that Gerrit is configured to push to
> across the WAN throughout the US and Asia. We are frequently seeing
> messages such as this:
>
> [2010-07-26 22:00:44,195] ERROR
> com.google.gerrit.server.git.PushReplication : Cannot replicate to
> gerrit2@<ip>:/home/gerrit2/repositories/Test.git
> org.eclipse.jgit.errors.TransportException: Read timed out
> at
> org.eclipse.jgit.transport.BasePackConnection.readAdvertisedRefs(BasePackConnection.java:
> 148)

Hmm. That's the SSH connection stalling and not suppling all of the
data during the initial connection. Its built on TCP and should be
retrying delivery of packets until it gets through, but that doesn't
mean TCP hasn't had to backoff transmission so much that the sending
bits by ship and/or pony express would be faster. :-)

> We have tried a great deal of tuning options, but I was hoping for
> some guidance on turning the git mirrors to prevent these issues from
> occurring. As it stands, we have an additional scheduled rsync that
> runs every 6 hours to catch the failures on the mirror sites, but this
> leaves us with gaps often times.
>
> We have one mirror located in Asia that replicates flawlessly, but
> another one drops occasionally. Another mirror in the US drops very
> frequently. While we understand that a lot of this could be
> contributed to network issues, is there any way to configure gerrit,
> or the mirrors, to retry and hopefully provided a lower rate of
> replication failures?

Unfortunately Gerrit doesn't automatically retry replication when it
fails. We really should do that. In the mean time you can schedule a
cron script on your gerrit server to force it to recheck replication
again every so often:

ssh -p 29418 localhost gerrit replicate --all

This just schedules every project for replication, so it runs fairly
quickly, but its effects can take some time to be seen on the mirrors.
If a project is already scheduled for replication, the command will
automatically coalesce the requests together. So its relatively
painless to run even if the requests from the prior cron script
haven't even started yet, the later one will just combine and run as
soon as possible. Maybe run it every 10 minutes to make sure the
mirrors stay in sync?

The relevant bug is already open, issue 482 [1] if you want to star it.

[1] http://code.google.com/p/gerrit/issues/detail?id=482

Anthony Russello

unread,

Jul 27, 2010, 10:50:11 AM7/27/10

to Shawn Pearce, Repo and Gerrit Discussion

Hi Shawn,

First, thanks for replying.

> Hmm. That's the SSH connection stalling and not suppling all of the
> data during the initial connection. Its built on TCP and should be
> retrying delivery of packets until it gets through, but that doesn't
> mean TCP hasn't had to backoff transmission so much that the sending
> bits by ship and/or pony express would be faster. :-)

Pony express would be much faster actually :(

We see it happen randomly, and have had to configure timeouts for the
replication. Re-running replication will often then halt at the same
repository, or sometimes a completely different one.

> Unfortunately Gerrit doesn't automatically retry replication when it
> fails. We really should do that. In the mean time you can schedule a
> cron script on your gerrit server to force it to recheck replication
> again every so often:
>
> ssh -p 29418 localhost gerrit replicate --all

I'll try this for a few days, and see how it works out.

> This just schedules every project for replication, so it runs fairly
> quickly, but its effects can take some time to be seen on the mirrors.
> If a project is already scheduled for replication, the command will
> automatically coalesce the requests together. So its relatively
> painless to run even if the requests from the prior cron script
> haven't even started yet, the later one will just combine and run as
> soon as possible. Maybe run it every 10 minutes to make sure the
> mirrors stay in sync?

I didn't know the that several requests would coalesce. That's handy
information. At least it won't be repeating in less than ten minutes
time.

I will see how it goes with that. Hopefully the extra traffic it
generates over the WAN won't annoy IT.

> The relevant bug is already open, issue 482 [1] if you want to star it.
>
> [1] http://code.google.com/p/gerrit/issues/detail?id=482

Consider it as good as starred.

Any thoughts on tuning the ethernet adapter to improve this at all?

Thanks,
Anthony

Shawn Pearce

unread,

Jul 27, 2010, 11:03:32 AM7/27/10

to Anthony Russello, Repo and Gerrit Discussion

On Tue, Jul 27, 2010 at 07:50, Anthony Russello <arus...@gmail.com> wrote:
> We see it happen randomly, and have had to configure timeouts for the
> replication. Re-running replication will often then halt at the same
> repository, or sometimes a completely different one.

Yuck. The timeout code is actually known to be buggy [2]. JSch has
some race conditions that present themselves when the timeout is
enabled. Digging around its code doesn't leave me with a lot of
confidence in the author's ability to write multi-threaded code in
Java. :-( I'm trying to move JGit away from JSch, but I haven't had
the time to do that work, or to finish the MINA SSHD client code to a
sufficient level that we can rely on it for replication.

[2] http://code.google.com/p/gerrit/issues/detail?id=232

> Any thoughts on tuning the ethernet adapter to improve this at all?

None. By the time you reach the Ethernet adapter, there isn't much
left to tune if the higher level stuff is seriously broken. I'm not a
networking expert, but it seems to me that if there is a lot of packet
loss, using a lower MTU at the TCP layer might make things more
reliable. It would cause TCP to slice the send window into more
smaller packets, each of which are more likely to fit into a single
packet being transited on the network without fragmentation. When a
packet is dropped, TCP resend would be able to recover just that one
packet until its successful. If the MTU is really big, a single TCP
packet might be broken into 4 smaller packets to fit on the network.
If any of those 4 is dropped, the entire series of 4 packets has to be
resent by TCP in order to recover. If packet loss is 25% on the WAN
link (e.g. due to already high utilization and the link needing to
discard packets when buffers fill in routers), the odds are very very
good that you will never get that TCP packet across the link, and the
whole TCP stream just stalls.

IIRC way back when we had PPP and SLIP on dial-up modems it was a good
idea to set your MTU to around the PPP or SLIP sizes. I think its
still the same with WAN links, especially those that are very busy.
But like I said, I'm not a networking expert. I just sometimes play
one on the Internet. :-)

Reply all

Reply to author

Forward