Scaling of multi-headed repos

45 views
Skip to first unread message

Gregory Szorc

unread,
Mar 10, 2014, 2:18:26 PM3/10/14
to mozilla-c...@googlegroups.com
Both Mercurial and Git experience scaling issues when repos grow to
thousands of heads.

We have to "reset" (a fancy way of saying delete) the Try repo every few
months because it gets too slow after 20,000 heads or so.

The code review repositories will experience this same problem.

I've talked with other organizations having this problem and there are
two recommended solutions:

1) Have the repo being pushed to round robin across destination repos
that aren't at capacity. I proposed this for the Try repo at
https://groups.google.com/d/msg/mozilla.dev.platform/Hb2EKXZmY70/G0IALqtABUkJ

2) Use bundles. Instead of pushing to a repo, have the client create a
bundle (`hg bundle` or `git bundle create`) and upload it to the server
[with metadata].

#2 scales much better because you can back your bundle store by any
key-value store (S3, etc) and you won't hit scaling limits in the
version control system or in storage. It does complicate things a bit as
any repo operation involving a changeset in a bundle will need to first
apply that bundle. This means read-only operations will require a repo
lock and writing a bundle to a repo in order to perform said operation.
That introduces its own scaling problems. And, you'll need a client
extension to perform the custom upload of the bundle. Oy.

Anyway, the point of this email was to remind people about the
many-headed scaling problem. It's likely out of scope for v1. But at
least now you know about it and can think of how resetting, round robin
load balancing, or bundles could play into the long-term, scalable solution.

Ehsan Akhgari

unread,
Mar 12, 2014, 9:33:56 PM3/12/14
to Gregory Szorc, mozilla-c...@googlegroups.com
(Sorry this email was stuck in the moderation queue, I've tried to set this group to accept posts from non-subscribers but it doesn't seem to work...)

I have already asked Taras about this and he has said that we're working on things which will give us hg repos with good performance in general.

All I know about the scaling problems here is about hg, do you have information on similar scaling problems in git?

On the solution #1, is there any way to "hide" the location of your remotes behind a single URL which all clients will access? 

Solution #2 seems really painful to me, if I'm understanding it correctly.  This means that you won't push the changesets in the usual way, right?

Another thing which just occurred to me is that if the number of heads is the only problem (I _think_ it's _a_ problem with Mercurial?), then we could just periodically do dummy null merges to trim down the number of heads.  What do you think about that?

Cheers,


--
You received this message because you are subscribed to the Google Groups "mozilla-code-review" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mozilla-code-review+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gregory Szorc

unread,
Mar 13, 2014, 7:04:45 PM3/13/14
to Ehsan Akhgari, mozilla-c...@googlegroups.com
On 3/12/14, 6:33 PM, Ehsan Akhgari wrote:
> (Sorry this email was stuck in the moderation queue, I've tried to set
> this group to accept posts from non-subscribers but it doesn't seem to
> work...)
>
> I have already asked Taras about this and he has said that we're working
> on things which will give us hg repos with good performance in general.
>
> All I know about the scaling problems here is about hg, do you have
> information on similar scaling problems in git?

An organization I talked to experienced this multi-headed scaling
problem with both Mercurial and Git. Mozilla will likely never throw as
much hardware at the problem as this other organization. This
organization decided it was better to avoid solving the problem (if it's
even possible) and scale with bundles

> On the solution #1, is there any way to "hide" the location of your
> remotes behind a single URL which all clients will access?

Sure. Have your ssh:// URLs round robin to several servers behind the
scenes. You can do this by installing a custom shell script that gets
executed when you SSH into the remote. You then have the remote print
the URL of the origin server it just happened to it.

> Solution #2 seems really painful to me, if I'm understanding it
> correctly. This means that you won't push the changesets in the usual
> way, right?

Correct. An extension or custom script would likely be required to avoid
typing curl commands.

> Another thing which just occurred to me is that if the number of heads
> is the only problem (I _think_ it's _a_ problem with Mercurial?), then
> we could just periodically do dummy null merges to trim down the number
> of heads. What do you think about that?

A possibility. But if we're talking about preserving the code review
repos for all of time, we're eventually talking repos with millions of
commits and tens of millions of file revisions. I think that's pushing a
limit (in both Mercurial and Git) that we should avoid if we have the
opportunity.

On the topic of preserving these repos for all of time. Should that be a
goal? We obviously don't want to lose review comments. If ReviewBoard is
pulling diffs from repos instead of caching them in its database, that
could be problematic for resetting repos. I'm not sure exactly how
ReviewBoard 2.0 works in this regard.

Another benefit of bundles just occurred to me: authentication. Our SSH
key management doesn't (currently) scale. It's much easier to post a
bundle through a Persona (or similar) authenticated web service than to
establish an SSH public key at Mozilla. Hopefully we make SSH key
management self-service some day...

George Miroshnykov

unread,
Mar 14, 2014, 7:14:55 AM3/14/14
to Ehsan Akhgari, Gregory Szorc, mozilla-c...@googlegroups.com

On March 14, 2014 at 1:04:47 AM, Gregory Szorc (g...@mozilla.com) wrote:

On 3/12/14, 6:33 PM, Ehsan Akhgari wrote: 
> (Sorry this email was stuck in the moderation queue, I've tried to set 
> this group to accept posts from non-subscribers but it doesn't seem to 
> work...) 
> 
> I have already asked Taras about this and he has said that we're working 
> on things which will give us hg repos with good performance in general. 
> 
> All I know about the scaling problems here is about hg, do you have 
> information on similar scaling problems in git? 

An organization I talked to experienced this multi-headed scaling 
problem with both Mercurial and Git. Mozilla will likely never throw as 
much hardware at the problem as this other organization. This 
organization decided it was better to avoid solving the problem (if it's 
even possible) and scale with bundles 

I can’t say anything about Mercurial, but having this problem with Git is really weird.
Git’s heads are basically branches - they are created and removed during project lifetime.
Unless you have thousands of active branches, you should be fine with Git.
As soon as the branch is removed, it’s head is gone and should not affect performance at all.
Here’s a nice writeup: http://git-scm.com/book/en/Git-Internals-Git-References

AFAIK Review Board won’t cache those diffs at the moment.
But at the same time, RB doesn’t care about the concept of “head” - all it needs is the changeset.
Again, I’m not sure how relevant this is for Mercurial, but just having the changeset somewhere in DAG should be enough for RB to work - we don’t need a separate head for each of those changesets.
Maybe we can “garbage collect” the heads in some way that leaves changesets in DAG?


Another benefit of bundles just occurred to me: authentication. Our SSH 
key management doesn't (currently) scale. It's much easier to post a 
bundle through a Persona (or similar) authenticated web service than to 
establish an SSH public key at Mozilla. Hopefully we make SSH key 
management self-service some day... 

If we use SSH keys for push access anyway, I think it’s more beneficial in the long term to figure out the key management.

Thanks,
George

Gregory Szorc

unread,
Mar 15, 2014, 6:47:50 PM3/15/14
to George Miroshnykov, Ehsan Akhgari, mozilla-c...@googlegroups.com
On 3/14/14, 4:14 AM, George Miroshnykov wrote:
>
> On March 14, 2014 at 1:04:47 AM, Gregory Szorc (g...@mozilla.com
> <mailto:g...@mozilla.com>) wrote:
>
>> On 3/12/14, 6:33 PM, Ehsan Akhgari wrote:
>> > (Sorry this email was stuck in the moderation queue, I've tried to set
>> > this group to accept posts from non-subscribers but it doesn't seem to
>> > work...)
>> >
>> > I have already asked Taras about this and he has said that we're working
>> > on things which will give us hg repos with good performance in general.
>> >
>> > All I know about the scaling problems here is about hg, do you have
>> > information on similar scaling problems in git?
>>
>> An organization I talked to experienced this multi-headed scaling
>> problem with both Mercurial and Git. Mozilla will likely never throw as
>> much hardware at the problem as this other organization. This
>> organization decided it was better to avoid solving the problem (if it's
>> even possible) and scale with bundles
>
> I can’t say anything about Mercurial, but having this problem with Git
> is really weird.
> Git’s heads are basically branches - they are created and removed during
> project lifetime.
> Unless you have thousands of active branches, you should be fine with Git.
> As soon as the branch is removed, it’s head is gone and should not
> affect performance at all.
> Here’s a nice writeup:
> http://git-scm.com/book/en/Git-Internals-Git-References

Right. And presumably we want all the commits persisting so we can go
back and look at the review, interdiffs, etc. That means keeping a ref
to the head/commit so it won't be garbage collected. Or you can disable
garbage collection, which might circumvent the scaling issue to some extent.
That works for Git since Git has garbage collection. Mercurial doesn't
have explicit garbage collection. Yes, you can use `hg strip` to remove
changesets. Yes, you can clone a subset of a repo to effectively delete
changesets. But the data principle of Mercurial follows the Hotel
California rule: you can check out but you can never leave. Revlogs are
append only and immutable. Once you commit a changeset, that changeset
lives forever. (Or at least that's the intent - things like mq and strip
subvert this property.)

Taras Glek

unread,
Mar 17, 2014, 6:23:06 PM3/17/14
to Gregory Szorc, George Miroshnykov, Ehsan Akhgari, mozilla-c...@googlegroups.com
Fyi,
I don't think this is a hard problem. We'll have trees persist forever. Whether we do that by having rolling trees or storing trees in s3 or bundles or some combo of the 3 options is an implementation detail. We'll get there, lets focus on getting a smooth review flow for now. I'll make sure right things(tm) happen on the backend.

Taras
> --
> You received this message because you are subscribed to the Google Groups
> "mozilla-code-review" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mozilla-code-re...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages