gitolite in a cluster...?

339 views
Skip to first unread message

"ken1 (Kenneth Ölwing)"

unread,
Nov 5, 2012, 10:46:31 AM11/5/12
to gito...@googlegroups.com
Hi,

I'm considering setting up a cluster of hosts running gitolite, e.g. A,
B, C. The hosts effectively would be identical, and will serve the same
repos (including gitolite-admin) from an NFS mount only reachable from
these hosts and symlinked from the ~gitoliteuser/repositories on each
host. In front of A/B/C, I'll put a simple loadbalancer LB that will
distribute calls onto the available instances.

On the pure git level, this shouldn't pose any problems as, AFAIK, git
makes sure to serialize access to a given repo; no different from the
case where John and Jane uses a file:// URL to access a repo on a shared
NFS location.

However, I'm unsure of how the multiple gitolite instances would
react...the gitolite-admin repo is common for the three, but I need to
also share something more between the hosts to make them. Would
~gitoliteuser/.gitolite be workable as a symlink too (similar to
repositories) above. Or just a part of it (conf/keydir...?)? Anything
else? Does gitolite today incorporate some form of locking to make sure
these things are kept serially maintained? Any alternative strategies
that comes to mind?

Perhaps this idea dead in the water for some very simple/known
reason...or is there some existing knowledge of this (some google search
variations doesn't seem to turn up much useful); seems like the bigger
players using gitolite at heart would have this type of setup going...

I'll play away with this a bit; it might even generate some patches to
handle this if I get that far :-)

TIA,

ken1

Sitaram Chamarty

unread,
Nov 5, 2012, 11:22:28 AM11/5/12
to ken1 (Kenneth Ölwing), gitolite
On Mon, Nov 5, 2012 at 9:16 PM, "ken1 (Kenneth Ölwing)"
<kenneth...@klarna.com> wrote:
> Hi,
>
> I'm considering setting up a cluster of hosts running gitolite, e.g. A, B,
> C. The hosts effectively would be identical, and will serve the same repos
> (including gitolite-admin) from an NFS mount only reachable from these hosts
> and symlinked from the ~gitoliteuser/repositories on each host. In front of
> A/B/C, I'll put a simple loadbalancer LB that will distribute calls onto the
> available instances.

Why? The main load in any git setup is IO (and possibly some CPU for
the clones when you have a large repo while they create the pack file
to be xfrd). How is an NFS setup going to help you in this?

Wouldnt you be happier using gitolite *mirroring*, which (as a bonus)
also gives you a couple of warm standby servers in case "A" dies?

> On the pure git level, this shouldn't pose any problems as, AFAIK, git makes
> sure to serialize access to a given repo; no different from the case where
> John and Jane uses a file:// URL to access a repo on a shared NFS location.
>
> However, I'm unsure of how the multiple gitolite instances would react...the
> gitolite-admin repo is common for the three, but I need to also share
> something more between the hosts to make them. Would ~gitoliteuser/.gitolite
> be workable as a symlink too (similar to repositories) above. Or just a part
> of it (conf/keydir...?)? Anything else? Does gitolite today incorporate some
> form of locking to make sure these things are kept serially maintained? Any
> alternative strategies that comes to mind?

Gitolite does not do any locking. As long as you push to only one
server at a time, things ought to work fine.

> Perhaps this idea dead in the water for some very simple/known reason...or

For me, it's dead because I don't see the benefit. Your CPU must be
really weak (and your IO must be really fast) for this to be useful.
But I could be wrong... maybe someone else will chip in with more
authority on the performance aspect.

> is there some existing knowledge of this (some google search variations
> doesn't seem to turn up much useful); seems like the bigger players using
> gitolite at heart would have this type of setup going...

All the big players I know use proper mirroring.

>
> I'll play away with this a bit; it might even generate some patches to
> handle this if I get that far :-)

Please discuss [right here on the list] before sending patches :)


--
Sitaram

"ken1 (Kenneth Ölwing)"

unread,
Nov 6, 2012, 4:43:04 AM11/6/12
to Sitaram Chamarty, gito...@googlegroups.com
Hi,


On 2012-11-05 17:22, Sitaram Chamarty wrote:
Why? The main load in any git setup is IO (and possibly some CPU for the clones when you have a large repo while they create the pack file to be xfrd). How is an NFS setup going to help you in this? Wouldnt you be happier using gitolite *mirroring*, which (as a bonus) also gives you a couple of warm standby servers in case "A" dies?

I consider clustering and mirroring to be two different things. Usecases differ, but there certainly is some overlap. One thing I'm not concerned with at the moment is geographically dispersed sites; in that case, mirroring would make much sense, i.e. having slaves that are local/close to respective site. Now, extreme performance isn't my main goal. It's more the manageability and scalability features I'm after.

To quote from the docs about mirroring:
Mirroring is simple: you have one "master" server and one or more "slave" servers. The slaves get updates only from the
master; to the rest of the world they are at best read-only.
With clustering, there is no master/slave distinction - they're all transparently, equal. All are available for both clone/push. What you quote as a bonus with mirroring regarding warm standbys... well, with clustering, they're all *hot*. Have one crash/die or even just bring one down for maintenance and no-one's the wiser; at worst some nudging of the loadbalancer is needed, but a decent one will realize a given host is unresponsive and will drop it from the list until told otherwise. It's eminently scalable - add as many hosts you like and tell the loadbalancer and you're good to go.

Exploring performance concerns:

As you say, CPU is seldom the problem so we can discount that in general. But I'm acting in a virtualization environment and they generally work much better with many small hosts with few cores (even just one) assigned per host rather than a few with many cores. This allows the hypervisor better granularity.

To get the behavior I envision it's a necessity that all servers see the same filesystem - hence, NFS. Now, I'm talking about a good NAS to serve this, not a regular nfsd in Linux; performance wise they are typically miles apart for many reasons - they're purpose built for the task after all. For the same reason they will also scale better with higher/concurrent loads. In fact, I've measured very close (and occasionally *better*) IO performance between a NAS and local disk (at a prior company). A NAS also answers issues around single-point-of-failure (assume one with redundant heads and network connections) and possibly backups (snapshot capability makes things a breeze).


Gitolite does not do any locking. As long as you push to only one server at a time, things ought to work fine.

Yes. So what happens if I push to more than one server at a time? :-)

Actually though, I'm also wondering what will happen if I push a second time to the same server, while the previous push is *still processed* by gitolite. I'm lacking a lot of understanding here, both on gitolite and git...

So assume this scenario:

Client A pushes a change to gitolite-admin. This triggers a rebuild of the conf file (as a post-update IIUC). In my case, this runs for ~10 minutes.

Now, client B comes along and pulls the latests changes and then immediately pushes it's own change...first of all, will this pull/push be accepted here since the previous post-update is still running (I assume it will, precisely because it's a post-update)? If it is, a new post-update is started. Will they conflict? For the sake of argument, assume that the last change makes the conf file really easy to process so it finishes in mere seconds - as the other post-update is in the middle of things I doubt it'll realize this so when it eventually finishes, it blithely writes its own version of the compiled conf. At least this would be my worry assuming they don't somehow synchronize and ensure things happen in serial.

Can this happen? If not, why? If it doesn't...well, then multiple hosts would work as well, no?

So, in the clustering scenario I paint, a gitolite instance is running on each processing host. They all see the same 'repositories' filesystem, including the gitolite-admin repo. And to work as one, they need to also have the same data to operate from, probably using the same shared filesystem, so this would require coordination/synchronization between them. If you can give me some hints on what you might think would be necessary to share/synchronize that would help.

An altogether different approach might be to cheat a bit and make it necessary to always push to a certain host for gitolite changes, effectively requiring all others to just have a read-only view of resulting configuration. Could work I guess, but less palatable.

Hmm. Well, it should be quite possible to test in general by some research. I'll look into it.


For me, it's dead because I don't see the benefit.  Your CPU must be
really weak (and your IO must be really fast) for this to be useful.
But I could be wrong... maybe someone else will chip in with more
authority on the performance aspect.

Maybe my arguments on benefits and CPU/IO performance issues shows that there is some merit to this train of thought? But I'd love to get further insight on these topics, as I know there's a lot I don't know, and probably a lot of things I've misunderstood :-)


Please discuss [right here on the list] before sending patches :) -- Sitaram

Absolutely. Thanks for your reply.

ken1

Sitaram Chamarty

unread,
Nov 6, 2012, 10:33:20 AM11/6/12
to ken1 (Kenneth Ölwing), gitolite
(I snipped large parts of the text to make it easier to read my reply; sorry)

On Tue, Nov 6, 2012 at 3:13 PM, "ken1 (Kenneth Ölwing)"
<kenneth...@klarna.com> wrote:
> Hi,
>
>
> On 2012-11-05 17:22, Sitaram Chamarty wrote:

> I consider clustering and mirroring to be two different things. Usecases

Your NAS disk is a single point of failure, but you get seamless
switchover. With a mirroring setup I have no single point of failure,
but switchover is usually manual.

Debate on which is better will probably be subjective and colored by
our respective experiences and past engineering decisions. Let's
leave it at that.

> But I'm acting in a virtualization environment and they generally work much
> better with many small hosts with few cores (even just one) assigned per
> host rather than a few with many cores. This allows the hypervisor better
> granularity.

Fair enough.

> Gitolite does not do any locking. As long as you push to only one server at
> a time, things ought to work fine.

[Uggh... Quoting got screwed up somehow; those lines should have two >
signs in front]

> Actually though, I'm also wondering what will happen if I push a second time
> to the same server, while the previous push is *still processed* by
> gitolite. I'm lacking a lot of understanding here, both on gitolite and
> git...

Git itself won't be affected at all. Since it is a content-addressed
system, any object can only be overwritten by *itself* if at all. So
it'll all work fine.

Gitolite also won't care or be affected, as long as you're doing this
to normal repos.

But when you push to the special 'gitolite-admin' repo, there is a
short period during which something nasty *may* occur during the time
that the new "compiled" config file is being written out.

However, my best attempts at making something nasty happen have failed
so far. I tried with about 6000 repositories, at which point a
'gitolite compile' takes about 6 seconds (on my laptop). I then ran
multiple 'gitolite compile' commands, as many as 4 overlapping runs.
The end result (the compiled file) was still produced correctly.

Ralf Hemmecke

unread,
Nov 6, 2012, 10:50:39 AM11/6/12
to Sitaram Chamarty, gito...@googlegroups.com
> Git itself won't be affected at all. Since it is a content-addressed
> system, any object can only be overwritten by *itself* if at all.

Not completely true. Mathematically the content space is infinite, but
the address space (although 40 hex digits is rather big) is finite. So
there are surely two files that map to the same address. ;-)

But we all believe that this will not happen in the lifetime of any
human in the next 100 (or even more) years.

Ralf

Sitaram Chamarty

unread,
Nov 6, 2012, 11:01:11 AM11/6/12
to Ralf Hemmecke, gitolite
We were talking about race conditions due to simultaneously pushing.
Actual content duplicates, as unlikely as they are, are even more
unlikely to trigger within 2 racing pushes.

--
Sitaram

"ken1 (Kenneth Ölwing)"

unread,
Nov 7, 2012, 10:14:23 AM11/7/12
to gito...@googlegroups.com
On 2012-11-06 16:33, Sitaram Chamarty wrote:
> (I snipped large parts of the text to make it easier to read my reply; sorry)
As do I...:-)
> Git itself won't be affected at all. Since it is a content-addressed
> system, any object can only be overwritten by *itself* if at all. So
> it'll all work fine.
Yes, that makes sense; although I'd guess finer points occur if things
like git gc/repack etc enter the equation, as they muck around under
objects. But I'd expect that git handles possible concurrency for that.

> But when you push to the special 'gitolite-admin' repo, there is a
> short period during which something nasty *may* occur during the time
> that the new "compiled" config file is being written out.
>
> However, my best attempts at making something nasty happen have failed
> so far. I tried with about 6000 repositories, at which point a
> 'gitolite compile' takes about 6 seconds (on my laptop). I then ran
> multiple 'gitolite compile' commands, as many as 4 overlapping runs.
> The end result (the compiled file) was still produced correctly.
As I described, with compile times of 10 minutes, the 'may' might be a
bit more likely to occur. But it's with v2, so maybe things have
improved, I just don't know yet. Anyway, I'll do some (unfair :-) tests,
to satisfy my curiosity, if nothing else...just as soon as I recover
from a busted local disk, with several Linux VM's in a questionable
state. Sigh...I'll get back if I have anything further to report.

Thanks for your time,

ken1


Sitaram Chamarty

unread,
Nov 7, 2012, 10:15:33 PM11/7/12
to ken1 (Kenneth Ölwing), gito...@googlegroups.com
On Wed, Nov 7, 2012 at 8:44 PM, "ken1 (Kenneth Ölwing)"
<kenneth...@klarna.com> wrote:
> On 2012-11-06 16:33, Sitaram Chamarty wrote:

>> Git itself won't be affected at all. Since it is a content-addressed
>> system, any object can only be overwritten by *itself* if at all. So
>> it'll all work fine.
>
> Yes, that makes sense; although I'd guess finer points occur if things like
> git gc/repack etc enter the equation, as they muck around under objects. But
> I'd expect that git handles possible concurrency for that.

No. The name of the packfile is itself a SHA, and in turn uniquely
describes the contents. Same logic applies.

The only time you *may* have trouble, if I recall, is on some kinds of
non-Unix file systems like CIFS. I'll have to dig up the link; sorry
don't have it handy.

>> However, my best attempts at making something nasty happen have failed
>> so far. I tried with about 6000 repositories, at which point a
>> 'gitolite compile' takes about 6 seconds (on my laptop). I then ran

That 6 seconds is wrong, sorry about that. The critical part is 0.3
seconds for 11,000 (yes, eleven thousand) repos. See below for
details.

>> multiple 'gitolite compile' commands, as many as 4 overlapping runs.
>> The end result (the compiled file) was still produced correctly.
>
> As I described, with compile times of 10 minutes, the 'may' might be a bit

I seem to recall you have a few hundred repos so I don't understand
why this is taking 10 minutes. The worst case timing for 500 repos on
my laptop is 12 seconds on v3. (Even if I have to *create* the repos
its about 2 minutes, but that's only one time so I am not counting
that).

If you're using v2 but without GL_BIG_CONFIG, please don't even bother
replying. I didn't spend time coding it for people to ignore it and
go into hypothetical situations that I don't want to deal with.

----------

Details on time taken by 'git push' on the admin repo, with a conf
containing 11,000 repos. All timings are on my lenovo X201 laptop.

Let's get these things out of the way first:

(1) This is all for v3. The total timings for v2 with GL_BIG_CONFIG
should be similar, although it is not easy to determine the breakup
because it was not modular enough.

(2) When you add a new repo, some extra things happen. You won't
notice unless you add a few hundred repos in one shot. I'm going to
assume we're talking about a push where the number of repos is not
significantly changed but perhaps some users were added etc., or their
access was changed, etc.

(3) The number of users is irrelevant for timing purposes on the push.
Your ~/.ssh/authorized_keys may become so big that sshd takes time to
log you in, but that's not *inside* gitolite. (My laptop adds about 1
second per 2500-3000 lines in the authkeys file, so don't worry about
it unless you have more than a thousand or so users).

Now for the timings...

(1) a 'git push' on the admin repo causes two things to happen:

(1.1) gitolite compile

This parses the config, converts it into a bunch of perl hashes, and
writes them to files. There is one common "gitolite.conf-compiled.pm"
in ~/.gitolite/conf which contains everything that is not specific to
an actual repo, and then each actual, named, repo, has a "gl-conf"
file in it. (See http://sitaramc.github.com/gitolite/g2/bc.html for
some details; although it's a v2 document the basic idea is the same
in v3).

So there's one common file, and 11,000 repo-specific files to write.

Parse: 30 seconds (cold cache time. If you repeat it this goes
down to 7 seconds).
Write 11,000 files: 5 to 7 seconds.
Write the common file: 0.2 to 0.3 seconds.

Potential race conditions are certainly possible in theory, and yes
it's trivial to use a lock file at the top of the post-update hook
code if you need to, but that's not the point here. If you're firing
off admin pushes at a rate that makes it likely you will hit two of
them within the same 0.2 second slot, you have something seriously
wrong in your setup or your understanding of gitolite.

(1.2) gitolite trigger POST_COMPILE

This does all the non-core stuff like setting up permissions for
gitweb and git-daemon and acting upon 'config' lines to turn them into
'git config ...' for each repo as needed.

This takes a *lot* of time for 11,000 repos:

update-git-configs: about 7 minutes
update gitweb access: 29 minutes
update git-daemon access: 21 minutes

However, if you're not interested in gitweb/daemon, you can remove
those lines from the POST_COMPILE list (as well as the POST_CREATE
list) in the rc file. Poof; all gone.

"ken1 (Kenneth Ölwing)"

unread,
Nov 8, 2012, 5:56:45 AM11/8/12
to Sitaram Chamarty, gito...@googlegroups.com
Hi,

Thanks for excellent depth and a performance walkthrough. On the subject
of the packfile, yet another aha-moment - should have been obvious. And
no, using CIFS underneath isn't really relevant so don't bother trying
to find old data about it...

***
Just some notes to hopefully clarify....

Yes, our server is v2 without GL_BIG_CONFIG, set up long before my time.
I'm not really 'ignoring' that - in fact, my intention is quite the
opposite: switch to v3 (for multiple reasons, this being one). Any
issues, especially this, is as such irrelevant as it works and I don't
seek any fixes for it or any explanation on why it happens. I'm not
really planning on changing it either for two reasons: 1) it's
automatically moot if I go to v3 and 2) as long as things work I don't
want (or need) to rock the boat.

So we can leave the current server, it has really nothing to do with
what I'm researching. I only dragged it in for the discussion of
possible race-conditions which was probably wrong as v3 is the important
goal; sorry.
***

So in case you still won't mind exploring a use-case for gitolite (which
admittedly is hypothetical at this point and not something you designed
it for) I'll try to walk through what I've learned so far...

I'm trying to explore what would happen when 'n' instances of gitolite
would see a shared filesystem with repositories so that a client could
randomly connect to any of them and get the same results. Whether this
is good/bad or even worthwhile is surely debatable. I have made some
arguments in favor, but it may turn out to be a silly idea, or even a
really bad one - this is really what I'm trying to figure out.

Assuming there *is* any merit to the above, what are the possible
issues? If any of the below is wildly incorrect please feel free to chop
my head off ;-).

1) A requirement is that the gitolite instances all are effectively
identical. Assuming they are to start with, it should all work
splendidly with all regular repos, both to read and to write thanks to
how Git itself works and that gitolite is essentially just a conduit to it.

2) A key problem is obviously the requirement of 'effectively
identical'. Since a push to the admin repo results in changes to the
configuration, this means that these changes need to be visible to all
instances.

3) Ignoring any concurrency issue to begin with, it seems a simple
approach is to make the git hosting users home-directory shared too.
Would this suffice or are there other places on a given host that
gitolite would update? Any other real showstoppers that comes to mind?
Perhaps a better approach would be to symlink just the relevant things;
'repositories', '.gitolite' and '.ssh/authorized_keys' comes to mind.

4) Assuming the above seems workable, we can consider possible concurrency.
Your research indicates that it's simply very unlikely to be a problem;
your tests was much more aggressive than any (normal) admin repo update
should be likely to be so...

5) Ok, so to satisfy even some of this, it is as you say, trivial to
implement a lock file during post-update, just to ensure that any such
event would always be serialized and pushing likelihood of unintended
clashes even further.

Well, in case you're still reading, anything you can provide further
considerations of is surely of interest.

ken1

Sitaram Chamarty

unread,
Nov 8, 2012, 6:44:52 AM11/8/12
to ken1 (Kenneth Ölwing), gito...@googlegroups.com
On Thu, Nov 8, 2012 at 4:26 PM, "ken1 (Kenneth Ölwing)"
<kenneth...@klarna.com> wrote:

> I'm trying to explore what would happen when 'n' instances of gitolite would
> see a shared filesystem with repositories so that a client could randomly
> connect to any of them and get the same results. Whether this is good/bad or
> even worthwhile is surely debatable. I have made some arguments in favor,
> but it may turn out to be a silly idea, or even a really bad one - this is
> really what I'm trying to figure out.
>
> Assuming there *is* any merit to the above, what are the possible issues? If

None.

> 1) A requirement is that the gitolite instances all are effectively
> identical.

isn't that what "shared filesystem" means?

> 2) A key problem is obviously the requirement of 'effectively identical'.
> Since a push to the admin repo results in changes to the configuration, this
> means that these changes need to be visible to all instances.

again, isn't that what "shared filesystem" means?

> 3) Ignoring any concurrency issue to begin with, it seems a simple approach
> is to make the git hosting users home-directory shared too. Would this

Umm, I assumed that.

I'm not getting your question. You started out with a certain
assumption regarding the setup ("shared filesystem") but seem to be
questioning that assumption in later points.

> suffice or are there other places on a given host that gitolite would
> update?

Now that's a real question. Here's a full list for a default install
(all in $HOME):

.ssh
.gitolite
.gitolite.rc
repositories

Everything else is unrelated to gitolite itself. For example the
"projects.list" file is also created by default but you can place it
anywhere else you like by setting GITWEB_PROJECTS_LIST (or some such
variable) in the rc file.

"ken1 (Kenneth Ölwing)"

unread,
Nov 8, 2012, 7:20:14 AM11/8/12
to Sitaram Chamarty, gito...@googlegroups.com
>> 3) Ignoring any concurrency issue to begin with, it seems a simple approach
>> is to make the git hosting users home-directory shared too. Would this
> Umm, I assumed that.
>
> I'm not getting your question. You started out with a certain
> assumption regarding the setup ("shared filesystem") but seem to be
> questioning that assumption in later points.
Aha; yes, I see that I was unclear - I only considered a shared
filesystem for 'repositories' to start with (symlinked into $HOME on
each host), and then moved on to share the *entire* homedir. It was
mostly the way I reasoned me through the list I made, I guess.

>
>> suffice or are there other places on a given host that gitolite would
>> update?
> Now that's a real question. Here's a full list for a default install
> (all in $HOME):
>
> .ssh
> .gitolite
> .gitolite.rc
> repositories
>
> Everything else is unrelated to gitolite itself. For example the
> "projects.list" file is also created by default but you can place it
> anywhere else you like by setting GITWEB_PROJECTS_LIST (or some such
> variable) in the rc file.
Ok, good. So we're constrained to $HOME only. I'm not too concerned
about gitweb as it's likely we can't have that running (I'd need to
setup another, fully authenticated solution, for browsing repos that way
- we're constrained by external rules to keep our source only accessible
in an authenticated manner).

I lastly mused on whether it'd be useful to keep $HOME separate on hosts
and symlink in just the above, but that's probably overthink.

Sorry if I sound confused sometimes but I truly value your feedback. It
still sounds like it could work from what I've heard so far.

Have a good one!

ken1

Reply all
Reply to author
Forward
0 new messages