Replication problems to gerrit slave server

456 views
Skip to first unread message

Luthander, Fredrik

unread,
Oct 21, 2009, 4:13:49 AM10/21/09
to repo-d...@googlegroups.com
Hello all Gerrit users!
 
My team and I run a configuration of gerrit servers here at Sony Ericsson, and we've experienced a few problems lately. I'm not familiar with how to enable any debug logs, so any hints on how to enable that is greatly appreciated.
 
We have one master gerrit server and several slave replication servers in different sites. This means that our users upload their changes, and also browse the gerrit web on the master server. The master gerrit service then transmits all changes to our replication servers (configured through replication.config), and our users replicate everything they need from these replication servers.
We havn't been able to configure gerrit in slave mode on all our replication servers yet, only one runs gerrit in slave, the others we run on the regular ssh daemon.
 
So, on to the problem we're facing:
Several times a week the replication gets stuck. The symptom is that while a random .git is pushed from the master server, that operation (from the perspective of the show-queue command) never seem to finish. And naturally, once we have one of these non-finishing replication jobs, the list behind that job just grows and changes starts lagging behind on all of our gits. It only affects the machine that runs the gerrit slave, not the regular ssh daemon servers. However, synchronization is performed through regular ssh service. It is also the replication server with the by far highest load, the others have very few users right now.
 
So far we havn't been able figure out how to empty the queue, or how to force gerrit to abandon or restart the job it's currently working on. As of now we are forced to restart the entire gerrit service on the master server. Any help on how we can get around the problem more conveniently is appreciated.
Also, we are willing to do what we can to extract information which would lead to the solving of this potential bug / finding of the problem we have with our environment. Unfortunately, as of today, none in my team are very skilled programmers. Therefore we'd need pretty detailed instructions on what to do to increase logging or any other activities that would be necessary to be able to track the problem down.
 
I hope the problem has been described well enough and that someone has a theory on the error. Any hints on the matter is highly appreciated! :)

--
Best regards,
     Fredrik Luthander

Sony Ericsson
Mobile Communications
sonyericsson.com

Nasser Grainawi

unread,
Oct 21, 2009, 10:19:24 AM10/21/09
to repo-d...@googlegroups.com

Hi Fredrik,

It'd be very helpful to see some info from your replication.config and to know
which version of Gerrit you're running.

Chances are all you need to do is set:
remote.<name>.timeout (described fully here:
http://gerrit.googlecode.com/svn/documentation/2.0/config-replication.html)

It defaults to waiting indefinitely, but it seems you're getting stuck with that.

That said, we also do replication, but I've never seen it get stuck, even on a
host with a very high load.

Nasser

Shawn Pearce

unread,
Oct 22, 2009, 4:12:49 PM10/22/09
to repo-d...@googlegroups.com

FYI, timeout is busted due to a nasty race condition bug inside the JSch client library we use for replication over ssh.

Unfortunately there also isn't a way to kill or restart the replication to a particular destination once it gets stuck.  Java doesn't really provide a great way to safely abort a running thread which is currently running code that we did not write (JSch) to be abortable.

I need to go a back into JGit here and rework how timeout is implemented.  I (and several others according to mailing archives) tried reporting the issues to the JSch developers but they say its working fine and don't want to fix it.

Best you can do right now is put each replication server into its on remote block.  IIRC this will create one job queue per remote and at least the other sites will remain current when the one site gets stuck.

On Oct 21, 2009 10:19 AM, "Nasser Grainawi" <nas...@codeaurora.org> wrote:

Luthander, Fredrik wrote: > Hello all Gerrit users! > > My team and I run a configuration of gerr...

Hi Fredrik,

It'd be very helpful to see some info from your replication.config and to know
which version of Gerrit you're running.

Chances are all you need to do is set:
remote.<name>.timeout (described fully here:
http://gerrit.googlecode.com/svn/documentation/2.0/config-replication.html)

It defaults to waiting indefinitely, but it seems you're getting stuck with that.

That said, we also do replication, but I've never seen it get stuck, even on a
host with a very high load.

Nasser

--~--~---------~--~----~------------~-------~--~----~ To unsubscribe, email repo-discuss+unsubscrib...

Fredrik Luthander

unread,
Oct 23, 2009, 12:06:58 PM10/23/09
to Repo and Gerrit Discussion
Hi everyone, and thanks for your prompt support!

In here we're running 2.0.22 currently, waiting eagerly for 2.0.24 any
day now. :-)

I'd like to thank Nasser for pointing us to the timeout-option. We
tried it, but that didn't work very well as pointed out by Shawn.
We've had our fair share of gerrit restarts during the day, hehe.
The git config already has one section per server, so as suggested
replication only hangs to the server in question and not all of them.
Can you have several threads but only one server in a section, and
thus have several threads sync to the same site? (I'm guessing no..)

Right now we're investigating other services on the quarreling server.
It's only one server that has these problems, and we have not been
able to identify if there's a configuration or service on the machine
that is the cause of our problems. This is the current theory though,
so we'll try to disable services one by one as long as we see the
problems. Maybe we'll get lucky with that.

Again, any debug info I can extract for you I'll be happy to give you.
Our current fix is to have a script monitor the show-queue command and
then restart the service automatically as soon as it's filling up and
not emptying as it should..

--
Best regards,
Fredrik Luthander

Sony Ericsson
Mobile Communications
sonyericsson.com

On Oct 22, 10:12 pm, Shawn Pearce <s...@google.com> wrote:
> FYI, timeout is busted due to a nasty race condition bug inside the JSch
> client library we use for replication over ssh.
>
> Unfortunately there also isn't a way to kill or restart the replication to a
> particular destination once it gets stuck.  Java doesn't really provide a
> great way to safely abort a running thread which is currently running code
> that we did not write (JSch) to be abortable.
>
> I need to go a back into JGit here and rework how timeout is implemented.  I
> (and several others according to mailing archives) tried reporting the
> issues to the JSch developers but they say its working fine and don't want
> to fix it.
>
> Best you can do right now is put each replication server into its on remote
> block.  IIRC this will create one job queue per remote and at least the
> other sites will remain current when the one site gets stuck.
>
> On Oct 21, 2009 10:19 AM, "Nasser Grainawi" <nas...@codeaurora.org> wrote:
>
> Luthander, Fredrik wrote: > Hello all Gerrit users! > > My team and I run a
>
> configuration of gerr...
> Hi Fredrik,
>
> It'd be very helpful to see some info from your replication.config and to
> know
> which version of Gerrit you're running.
>
> Chances are all you need to do is set:
> remote.<name>.timeout (described fully here:http://gerrit.googlecode.com/svn/documentation/2.0/config-replication...)

Shawn Pearce

unread,
Oct 23, 2009, 6:16:13 PM10/23/09
to repo-d...@googlegroups.com

Check the replication docs, there is a thread parameter per remote that permits more than one project to replicate at a time.  Each thread works independently so a restart is only necessary once all threads are stuck.

On Oct 23, 2009 12:07 PM, "Fredrik Luthander" <fredrik....@sonyericsson.com> wrote:


Hi everyone, and thanks for your prompt support!

In here we're running 2.0.22 currently, waiting eagerly for 2.0.24 any
day now. :-)

I'd like to thank Nasser for pointing us to the timeout-option. We
tried it, but that didn't work very well as pointed out by Shawn.
We've had our fair share of gerrit restarts during the day, hehe.
The git config already has one section per server, so as suggested
replication only hangs to the server in question and not all of them.
Can you have several threads but only one server in a section, and
thus have several threads sync to the same site? (I'm guessing no..)

Right now we're investigating other services on the quarreling server.
It's only one server that has these problems, and we have not been
able to identify if there's a configuration or service on the machine
that is the cause of our problems. This is the current theory though,
so we'll try to disable services one by one as long as we see the
problems. Maybe we'll get lucky with that.

Again, any debug info I can extract for you I'll be happy to give you.
Our current fix is to have a script monitor the show-queue command and
then restart the service automatically as soon as it's filling up and
not emptying as it should..

--

Best regards, Fredrik Luthander Sony Ericsson Mobile Communications sonyericsson.com

On Oct 22, 10:12 pm, Shawn Pearce <s...@google.com> wrote: > FYI, timeout is busted due to a nasty r...

> On Oct 21, 2009 10:19 AM, "Nasser Grainawi" <nas...@codeaurora.org> wrote: > > Luthander, Fredrik ...

> remote.<name>.timeout (described fully here:http://gerrit.googlecode.com/svn/documentation/2.0/config-replication...)

> > It defaults to waiting indefinitely, but it seems you're getting stuck with > that. > > That sai...

--~--~---------~--~----~------------~-------~--~----~ To unsubscribe, email repo-discuss+unsubscribe...

Reply all
Reply to author
Forward
0 new messages