Running multiple Gerrit instances against the same ReviewDB/Git FS

956 views
Skip to first unread message

AlBlue

unread,
Aug 10, 2011, 4:45:37 AM8/10/11
to Repo and Gerrit Discussion
I'm looking into the practicality of running multiple Git daemons (on
different hosts) pointing to the same ReviewDB and back-end Git
repositories, as a means of load-balancing between multiple clients.

Whilst most data is immutable in the ReviewDB, and Gerrit appears to
do a lot of reloading of the database's state (e.g. to determine if
another program has run any 'gerrit review' flags, I'm concerned
whether there might be some data which isn't reloaded upon demand from
the database.

The one thing I can think of is the change set number (which is used
to calculate the next branch to push to) as this is monotonically
increasing from one review push to the next. As well as race
conditions between updating that number, is there anything else to be
concerned about? Are the row index values calculated via some kind of
automatic integer in the database, or is it a client-side operation?

I'd be happy to help contribute changes necessary to try and remove
any contentious points if there were any. For example, instead of
using a gerrit-wide branch number, perhaps it could be stored in a per-
project .git/config entry.

Alex

Magnus Bäck

unread,
Aug 10, 2011, 1:06:52 PM8/10/11
to Repo and Gerrit Discussion
On Wednesday, August 10, 2011 at 10:45 CEST,
AlBlue <alex.b...@gmail.com> wrote:

> I'm looking into the practicality of running multiple Git daemons
> (on different hosts) pointing to the same ReviewDB and back-end Git
> repositories, as a means of load-balancing between multiple clients.

Do you mean balancing the read access of the repositories?

> Whilst most data is immutable in the ReviewDB, and Gerrit appears to
> do a lot of reloading of the database's state (e.g. to determine if
> another program has run any 'gerrit review' flags, I'm concerned
> whether there might be some data which isn't reloaded upon demand
> from the database.

Gerrit is currently not ready for multi-master operation. You need to
have a single master server, but that master can send data to any number
of slave servers to offload the traffic from clients that fetch data.

[...]

--
Magnus Bäck Opinions are my own and do not necessarily
SW Configuration Manager represent the ones of my employer, etc.
Sony Ericsson

Shawn Pearce

unread,
Aug 10, 2011, 1:10:12 PM8/10/11
to AlBlue, Repo and Gerrit Discussion
On Wed, Aug 10, 2011 at 01:45, AlBlue <alex.b...@gmail.com> wrote:
> I'm looking into the practicality of running multiple Git daemons (on
> different hosts) pointing to the same ReviewDB and back-end Git
> repositories, as a means of load-balancing between multiple clients.
>
> Whilst most data is immutable in the ReviewDB, and Gerrit appears to
> do a lot of reloading of the database's state (e.g. to determine if
> another program has run any 'gerrit review' flags, I'm concerned
> whether there might be some data which isn't reloaded upon demand from
> the database.

There is a *lot* of data cached in memory in the Gerrit server. The
entire "accounts", "account_group_members", "account_external_ids",
"account_ssh_keys", etc. tables are held in memory by the server and
hardly read from the database. (Actually its a cache, but assuming
your cache is larger than the active set, its read once and never read
again.) In Gerrit 2.1.x the "projects" and "ref_rights" are also
cached, holding the access control data. In Gerrit 2.2.x the
"refs/meta/config" branch is parsed and held cached until the server
notices the branch was changed, which it checks for every few minutes.

> The one thing I can think of is the change set number (which is used
> to calculate the next branch to push to) as this is monotonically
> increasing from one review push to the next.

This is incremented by the database, not the Gerrit server. So its safe.

What isn't safe is the processing of the submit queue. When a change
gets submitted Gerrit puts the change into the submit queue in memory.
That queue is then run essentially single-threaded to update the Git
branch and mark the change as merged. If the server goes down and
comes back up, the submit queue is reloaded from the database and
processed by the server. If you have N servers, and they all restart
at the same time (e.g. power hiccup and they all restart), the submit
queue will be processed N times concurrently with lots of race
conditions and failures.

In theory this stuff will be safe, the Git low-level locking will make
sure the branch isn't corrupted, but you will probably see a lot of
false lock errors in the error logs due to lock contention, and you
may see changes get stuck in Submitted state even though they are
actually Merged because the lock contention caused 1 server to succeed
and write Merged, and then the other server to fail and try to write
Submitted back to the database to support retrying later.

Basically I didn't write the code to support this use case, its never
been tested like this, so I wouldn't suggest running it that way in
production right now.

> I'd be happy to help contribute changes necessary to try and remove
> any contentious points if there were any.

If you want to load-balance the Git-over-SSH or Git-over-HTTP
operations, you can use slave servers. Check the daemon --slave flag
in the documentation. The slaves can use the same Git repository and
database. All web UI and write operations have to go through the
master, but Git reads can be sent through the slaves. If you run the
slave caches with a shorter maxAge (e.g. 1 hour rather than 90 days)
users who change their SSH keys will have less time to wait for a
slave to pick up their new key.

http://gerrit.googlecode.com/svn/documentation/2.2.0/pgm-daemon.html

Alex Blewitt

unread,
Aug 10, 2011, 3:04:14 PM8/10/11
to Shawn Pearce, Repo and Gerrit Discussion
On 10 Aug 2011, at 18:10, Shawn Pearce wrote:

On Wed, Aug 10, 2011 at 01:45, AlBlue <alex.b...@gmail.com> wrote:
I'm looking into the practicality of running multiple Git daemons (on
different hosts) pointing to the same ReviewDB and back-end Git
repositories, as a means of load-balancing between multiple clients.

There is a *lot* of data cached in memory in the Gerrit server. The
entire "accounts", "account_group_members", "account_external_ids",
"account_ssh_keys", etc. tables are held in memory by the server and
hardly read from the database.

I suspect that would be fine; new accounts don't get created too frequently
in any case, and the database can be pre-loaded with external ids, if
nothing else. Perhaps having the option to reload this content periodically
(as for the slave, but for the master) would be a way of picking up more
frequent updates; but I'd expect that to be every day rather than more
frequently.

The one thing I can think of is the change set number (which is used
to calculate the next branch to push to) as this is monotonically
increasing from one review push to the next.

This is incremented by the database, not the Gerrit server. So its safe.

Good to know, thanks.

What isn't safe is the processing of the submit queue. 
...and you may see changes get stuck in Submitted state even though they are

actually Merged because the lock contention caused 1 server to succeed
and write Merged, and then the other server to fail and try to write
Submitted back to the database to support retrying later.

Basically I didn't write the code to support this use case, its never
been tested like this, so I wouldn't suggest running it that way in
production right now.

Good advice. 

Is this something I can take a look at to see if I can help? Looking at the
code, it looks like ReloadSubmitQueueOp.java is the place to start, 
along with ChangeMergeQueue.java.

I'd be happy to help contribute changes necessary to try and remove
any contentious points if there were any.

If you want to load-balance the Git-over-SSH or Git-over-HTTP
operations, you can use slave servers.

Thanks, that's also good to know. I suspect read-only replicas will benefit from
this, but ideally I'd like to have live-live Gerrit writable instances as well.

Alex

Shawn Pearce

unread,
Aug 11, 2011, 1:23:44 PM8/11/11
to Alex Blewitt, Repo and Gerrit Discussion
On Wed, Aug 10, 2011 at 12:04, Alex Blewitt <alex.b...@gmail.com> wrote:
> > What isn't safe is the processing of the submit queue.
...
> > Basically I didn't write the code to support this use case, its never
> > been tested like this, so I wouldn't suggest running it that way in
> > production right now.
>
> Good advice.
> Is this something I can take a look at to see if I can help? Looking at the
> code, it looks like ReloadSubmitQueueOp.java is the place to start,
> along with ChangeMergeQueue.java.

Yup. That's the two classes.

It might be acceptable to have ReloadSubmitQueueOp only run on a
single server, and not on the others (e.g. by a configuration flag).
Or to not run it at all and instead have an SSH command an
administrator can execute against a specific server to make that load
and retry the submit queue after a downtime event has occurred.

> > Thanks, that's also good to know. I suspect read-only replicas will benefit
> > from
> > this, but ideally I'd like to have live-live Gerrit writable instances as
> > well.

A lot of folks would like to have this, and that submit queue stuff is
one of the things preventing it right now. The other is the caches
being (reasonably) coherent between servers.

Reply all
Reply to author
Forward
0 new messages