Replication status page?

lucamilanesio

unread,

Jan 19, 2015, 8:36:10 PM1/19/15

to repo-d...@googlegroups.com

Hi all,

it seems a silly question (and pardon me if it really is) but how do you typically check the status of your Gerrit slaves replica against the master repos?

Imagine that one slave is going off-line for any reason or is too much behind the master, how to detect the situation?

Do you have any automation script for that?

Is there any way to detect the problem from the Gerrit master?

Has anyone developed any plugin for showing a status page? (RAG on a per-node/repo basis)

Thank you in advance for sharing your view :-)

Luca.

lucamilanesio

unread,

Jan 20, 2015, 10:16:04 AM1/20/15

to repo-d...@googlegroups.com

I was thinking about a simple script to:

- for every repo on Gerrit master

- for every replica of repo <repo> to remote <remote>

- compare the "git ls-remote <repo> | sort | sha1sum" with "git ls-remote <remote repo> | sort | sha1sum"

If the remote replica is up-to-date, the two sha1sum would be identical, otherwise the replica is not up-to-date.

Any comments on this? Am I missing anything? Has anyone developed a script or plugin to automate this?

Feedback is highly appreciated :-)

Luca.

Matthias Sohn

unread,

Jan 20, 2015, 11:22:54 AM1/20/15

to lucamilanesio, Repo and Gerrit Discussion

I think you first need to check that all repositories have a replica

-Matthias

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Luca Milanesio

unread,

Jan 20, 2015, 11:59:26 AM1/20/15

to Matthias Sohn, Repo and Gerrit Discussion

Thanks Mathias, if the repo doesn't exist on the replica then I assume the git ls-remote <remote repo> would then fail.

In that case is a RED flag :-)

Luca.

Martin Fick

unread,

Jan 20, 2015, 1:41:12 PM1/20/15

to lucamilanesio, repo-d...@googlegroups.com

This is going to sound strange at first, but due to the number of refs that we have, it typically takes about the same amount of time to check a slave's status as it does to bring it up-to-date. This makes running a separate status mechanism not very useful. We generally just know how far behind a slave is by looking at the replication queue to see how many tasks it has. So the queue is the status report.

People have often asked us internally 'how can we tell when a slave is behind", and my answer is typically "almost always" because our system has enough uploads that it is always replicating. So it would be fairly difficult to even try to give useful information beyond the amount of waiting tasks for a slave.

The only useful status check we have had some success with is checking a slave for specific project/ref combos to see if it ready for a specific build.

-Martin

lucamilanesio

unread,

Jan 20, 2015, 1:57:36 PM1/20/15

to repo-d...@googlegroups.com, luca.mi...@gmail.com

Hi Martin,

in a nutshell you are saying that the ls-remote command execution would just make the overall check quite ineffective: the time you wait for the remote command to finish ... the result would be already obsolete :-(

I need to check if the actual time for the replicas to answer the "ls-remote" query and if that approach would work for us or not.

The "project/ref" combination is a good compromise indeed: if you do know that a branch (e.g. master) is very likely to be updated on a daily basis, you just check *that* combination and get a quick response. Thanks for that suggestion :-)

Other feedback / experience on that? :-)

I remember the SonyMobile guys had lots of remote replicas ... how do you make sure that they are up-to-date other than checking the Gerrit queue?

Luca.

Jan Kundrát

unread,

Jan 20, 2015, 2:15:08 PM1/20/15

to repo-d...@googlegroups.com

On Tuesday, 20 January 2015 04:51:51 CEST, Martin Fick wrote:
> We generally just know how far behind a slave is by looking at
> the replication queue to see how many tasks it has.

Hi Martin,
how do you check this queue? Some of my refs were denied replication due to
server explicitly rejecting them ("reason: hook declined"), and I don't see
these refs in `gerrit show-queue`. I do see some tasks in there when the
replication is pending.

Cheers,
Jan

--
Trojitá, a fast Qt IMAP e-mail client -- http://trojita.flaska.net/

Martin Fick

unread,

Jan 20, 2015, 2:53:25 PM1/20/15

to "Jan Kundrát", repo-d...@googlegroups.com

> On Tuesday, 20 January 2015 04:51:51 CEST, Martin Fick wrote:
>> We generally just know how far behind a slave is by looking at
>> the replication queue to see how many tasks it has.
>
> Hi Martin,
> how do you check this queue? Some of my refs were denied replication due
> to server explicitly rejecting them ("reason: hook declined"), and I don't
> see these refs in `gerrit show-queue`. I do see some tasks in there when
the
> replication is pending.

They may be failing with some error that does not cause them to reschedule
or you do not have rescheduling enabled. For cases like this, you likely
need to check your log file,

-Martin

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

Martin Fick

unread,

Jan 20, 2015, 3:05:34 PM1/20/15

to lucamilanesio, repo-d...@googlegroups.com, luca.mi...@gmail.com

> Hi Martin,
>
> in a nutshell you are saying that the ls-remote command execution would
> just make the overall check quite ineffective: the time you wait for the
> remote command to finish ... the result would be already obsolete :-(

Somewhat, yes. But also, if you are going to take that time to check, the
extra cost to actually bring the slave up-to-date is generally tiny, so
why bother doing just the checking? It is typically more effective to
always replicate. Also, depending on where the slave is, the slow time is
not neccessarily due to ls-remote executing, but rather because the data
it outputs needs to go over the WAN (tons of refs, tons of data from
ls-remote). So the slaves which are mostly likely to lag, are the slaves
which do not have the WAN bandwidth to spare to be able to run extra
checks against!

> I need to check if the actual time for the replicas to answer the
> "ls-remote" query and if that approach would work for us or not.
>
> The "project/ref" combination is a good compromise indeed: if you do know
> that a branch (e.g. master) is very likely to be updated on a daily basis,
> you just check *that* combination and get a quick response. Thanks for
> that suggestion :-)

To check a single branch requires a custom script to access the remote
repo directly (likely using port 22) not using ls-remote. So either you
save because you are doing that, or because you are only doing the check
(even if using ls-remote) when you need the answer (not regularly or that
would just make the replication system slower). Either way this works as
a solution to make sure you have what you need for your build, but it is
not a generic monitoring solution which can give you a "replication status
view" of your slave,

Sven Selberg

unread,

Jan 22, 2015, 8:22:03 AM1/22/15

to repo-d...@googlegroups.com, luca.mi...@gmail.com

> I remember the SonyMobile guys had lots of remote replicas ... how do you make sure that they are up-to-date other than checking the Gerrit queue?

Hi Luca,

Sorry for the late reply.

Yes, checking the Gerrit queue is how we monitor the replication from the master side.

I know there are monitoring scripts that listens to stream-events and after a set period of time checks to see if slaves are up to date, and warns if they are not.

Luca Milanesio

unread,

Jan 22, 2015, 4:23:30 PM1/22/15

to Sven Selberg, repo-d...@googlegroups.com

Hi Sven,

thanks for the feedback.

The stream events is a good idea indeed :-)

Instead of checking all remote refs (as Martin mentioned, it may take too long and be ineffective) knowing for sure what has changed and checking only *that* specific update would be much better.

Do you know if any of those scripts have been published somewhere?

Thank you in advance for your feedback.

Luca.

Sven Selberg

unread,

Jan 23, 2015, 2:46:36 AM1/23/15

to repo-d...@googlegroups.com, sven.s...@sonymobile.com

The scripts are approx five years old, so I don't think they are usable as-is. They date back to the days when we had big problems with synchronization of slaves, so they aren't running at the moment.

I'm guessing it would take as long to produce them as finding them :-) but I'll investigate...

/Sven

Sven Selberg

unread,

Jan 23, 2015, 3:39:27 AM1/23/15

to repo-d...@googlegroups.com, sven.s...@sonymobile.com

Here's the short version of the scripts. I tried to cut them down to basics.

From each slave...

/Sven

##########################

### check_ps_replication.py ###
##########################

import os, sys, subprocess

import time, datetime

import string

import json

# CONFIGURATION

gerrit_user = '<gerrit-username>@<gerrit-server>'

check_ps_cmd = 'PATH/TO/check-patchset.sh'

# CONSTANTS

gerrit_cmd = ['ssh', '-p', '29418', gerrit_user, 'gerrit', 'stream-events']

def check_replication_time( timestamp, patchset_sha1, patchset_project, patchset_branch ):

subprocess.Popen([check_ps_command, timestamp, patchset_sha1, patchset_project, patchset_branch])

def process_patchset_created(event):

json_data = json.loads(event)

capture_time = datetime.datetime.now()

capture_timestamp = capture_time.strftime("%s")

check_replication_time( capture_timestamp, json_data["patchSet"]["revision"], json_data["change"]["project"], json_data["change"]["branch"] )

# MAIN FUNCTION

while True:

try:

start_time = datetime.datetime.now()

print start_time, ': Waiting for stream event'

input_stream = subprocess.Popen(gerrit_cmd, bufsize=0, shell=False, stdout=subprocess.PIPE,)

for event in iter(input_stream.stdout.readline, ""):

if event.find('patchset-created') == 9:

process_patchset_created(event)

finally:

abort_time = datetime.datetime.now()

print abort_time, ': Connection dropped. Waiting 15 seconds...'

time.sleep(15.0)

##########################

### /check_ps_replication.py ###
##########################

#####################

### check_patchset.sh ###

#####################

# CONSTANTS

GIT_URL=/PATH/TO/GERRIT_SITE/git

MAX_PERIOD=3600

# PARAMETER

CAPTURE_TIMESTAMP=$1

PATCH_SHA1=$2

PATCH_PROJECT=$3

PATCH_BRANCH=$4

# MAIN

if [ ! -d $GIT_URL/$PATCH_PROJECT.git ]

then

echo "Repository Not Found" $PATCH_PROJECT

exit 1;

fi

cd $GIT_URL/$PATCH_PROJECT.git

counter=0;

while true

do

let counter=$counter+1

if [ $counter -gt $MAX_PERIOD ]

then

break

fi

patch_type=`git cat-file -t $PATCH_SHA1 2>/dev/null`

if [[ !$? && $patch_type = "commit" ]]

then

# replicated now.

SYNCED_TIMESTAMP=`date +%s`

let REP_PERIOD=$SYNCED_TIMESTAMP-$CAPTURE_TIMESTAMP

echo "Yay! it took $REP_PERIOD seconds"

exit 0

fi

# not replicated yet.

echo $CAPTURE_TIMESTAMP ": Sleeping ..." $counter

sleep 1

done

echo "Ney, not Replicated within $MAX_PERIOD seconds."

exit 1

#####################

### /check_patchset.sh ###

#####################

Sven Selberg

unread,

Jan 23, 2015, 3:41:29 AM1/23/15

to repo-d...@googlegroups.com, sven.s...@sonymobile.com

| so they aren't running at the moment.

My bad, they are still running...

/Sven

Reply all

Reply to author

Forward