I would like to (re-)open a discussion on the following specific question:
Assuming we are moving the llvm project to git, should we
a) use multiple git repositories, linked together as subrepositories
of an umbrella repo, or
b) use a single git repository for most llvm subprojects.
The current proposal assembled by Renato follows option (a), but I
think option (b) will be significantly simpler and more effective.
Moreover, I think the issues raised with option (b) are either
incorrect or can be reasonably addressed.
Specifically, my proposal is that all LLVM subprojects that are
"version-locked" (and/or use the common CMake build system) live in a
single git repository. That probably means all of the main llvm
subprojects other than the test-suite and maybe libc++. From looking
at the repository today that would be: llvm, clang, clang-tools-extra,
lld, polly, lldb, llgo, compiler-rt, openmp, and parallel-libs.
Let's first talk about the advantages of a single repository. Then
we'll address the disadvantages raised.
At a high level, one repository is simpler than multiple repos that
must be kept in sync using an external mechanism. The submodules
solution requires nontrivial automation to maintain the history of
commits in the umbrella repo (which we need if we want to bisect, or
even just build an old revision of clang), but no such mechanisms are
required if we have a single repo.
Similarly, it's possible to make atomic API changes across subprojects
in a single repo; we simply can't do with the submodules proposal.
And working with llvm release branches becomes much simpler.
In addition, the single repository approach ties branches that contain
changes to subprojects (e.g. clang) to a specific version of llvm
proper. This means that when you switch between two branches that
contain changes to clang, you'll automatically check out the right
llvm bits.
Although we can do this with submodules too, a single repository makes
it much easier.
As a concrete example, suppose you are working on some changes in
clang. You want to commit the changes, then switch to a new branch
based on tip of head and make some new changes. Finally you want to
switch back to your original branch. And when you switch between
branches, you want to get an llvm that's in sync with the clang in
your working copy.
Here's how I'd do it with a monolithic git repository, option (b):
git commit # old-branch
git fetch
git checkout -b new-branch origin/master
# hack hack hack
git commit # new-branch
git checkout old-branch
Here's how I'd do it with option (a), submodules. I've used git -C
here to make it explicit which repo we're working in, but in real life
I'd probably use cd.
# First, commit to two branches, one in your clang repo and one in your
# master repo.
git -C tools/clang commit # old-branch, clang submodule
git commit # old-branch, master repo
# Now fetch the submodule and check out head. Start a new branch in the
# umbrella repo.
git submodule foreach fetch
git checkout -b origin/master new-branch
git submodule update
# Start a new branch in the clang repo pointing to the current head.
git checkout -b -C tools/clang new-branch
# hack hack hack
# Commit both branches.
git commit -C tools/clang # new-branch
git commit # new-branch
# Check out the old branch.
git checkout old-branch
git submodule update
This is twice as many git commands, and almost three times as much
typing, to do the same thing.
Indeed, this is so complicated I expect that many developers wouldn't
bother, and will continue to develop the way we currently do. They
would thus continue to be unable to create clang branches that include
an llvm revision. :(
There are real simplifications and productivity advantages to be had
by using a single repository. They will affect essentially every
developer who makes changes to subprojects other than LLVM proper,
cares about release branches, bisects our code, or builds old
revisions.
So that's the first part, what we have to gain by using a monolithic
repository. Let's address the downsides.
If you'll bear with a hypothetical: Imagine you could somehow make the
monolithic repository behave exactly like the N separate repositories
work today. If so, that would be the best of both worlds: Those of us
who want a monolithic repository could have one, and those of us who
don't would be unaffected. Whatever downsides you were worried about
would evaporate in a mist of rainbows and puppies.
It turns out this hypothetical is very close to reality. The key is
git sparse checkouts [1], which let you check out only some files or
directories from a repository. Using this facility, if you don't like
the switch to a monolithic repository, you can set up your git so
you're (almost) entirely unaffected by it.
If you want to check out only llvm and clang, no problem. Just set up
your .git/info/sparse-checkout file appropriately. Done.
If you want to be able to have two different revisions of llvm and
clang checked out at once (maybe you want to update your clang bits
more often than you update your llvm bits), you can do that too. Make
one sparse checkout just of llvm, and make another sparse checkout
just of clang. Symlink the clang checkout to llvm/tools/clang.
That's it. The two checkouts can even share a common .git dir, so you
don't have to fetch and store everything twice.
As far as I can tell, the only overhead of the monolithic repository
is the extra storage in .git. But this is quite small in the scheme
of things.
The .git dir for the existing monolithic repository [2] is 1.2GB. By
way of comparison, my objdir for a release build of llvm and clang is
3.5G, and a full checkout (workdir + .git dirs) of llvm and clang is
0.65G.
If the 1.2G really is a problem for you (or more likely, your
automated infrastructure), a shallow clone [3] takes this down to 90M.
The critical point to me in all this is that it's easy to set up the
monolithic repository to appear like it's a bunch of separate repos.
But it is impossible, insofar as I can tell, to do the opposite. That
is, option (b) is strictly more powerful than option (a).
Renato has understandably pointed out that the current proposal is
pretty far along, so please speak up now if you want to make this
happen. I think we can.
Regards,
-Justin
[1] Git sparse checkouts were introduced in git 1.7, in 2010. For more
info, see http://jasonkarns.com/blog/subdirectory-checkouts-with-git-sparse-checkout/.
As far as I can tell, sparse checkouts work fine on Windows, but you
have to use git-bash, see http://stackoverflow.com/q/23289006.
[2] https://github.com/llvm-project/llvm-project
[3] git clone --depth=1 https://github.com/llvm-project/llvm-project.git
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
FWIW, I'm opposed. I'm not convinced that the problems with multiple
repos are any worse than the problems with a single repo, which makes
this more or less just change for the sake of change, IMO.
Justin Lebar via llvm-dev <llvm...@lists.llvm.org> writes:
> I would like to (re-)open a discussion on the following specific question:
>
> Assuming we are moving the llvm project to git, should we
> a) use multiple git repositories, linked together as subrepositories
> of an umbrella repo, or
> b) use a single git repository for most llvm subprojects.
>
> The current proposal assembled by Renato follows option (a), but I
> think option (b) will be significantly simpler and more effective.
> Moreover, I think the issues raised with option (b) are either
> incorrect or can be reasonably addressed.
>
> Specifically, my proposal is that all LLVM subprojects that are
> "version-locked" (and/or use the common CMake build system) live in a
> single git repository. That probably means all of the main llvm
> subprojects other than the test-suite and maybe libc++. From looking
> at the repository today that would be: llvm, clang, clang-tools-extra,
> lld, polly, lldb, llgo, compiler-rt, openmp, and parallel-libs.
FWIW, I'm opposed. I'm not convinced that the problems with multiple
repos are any worse than the problems with a single repo, which makes
this more or less just change for the sake of change, IMO.
On Wed, Jul 20, 2016 at 5:02 PM, Justin Bogner via llvm-dev
<llvm...@lists.llvm.org> wrote:
> FWIW, I'm opposed. I'm not convinced that the problems with multiple
> repos are any worse than the problems with a single repo, which makes
> this more or less just change for the sake of change, IMO.
Right now we *are* in a monorepo, with sequential revision numbers
across llvm and clang, so I'd say trying to move to separate repos is
actually the "change" here. :)
-- Sanjoy
Not true. SVN can be checked out by directory, Git needs to be cloned
on the root.
Today I *can* checkout only LLVM and Clang. On a single Git repo I can't.
cheers,
--renato
+1 to everything Justin points out here (and the rest of the email, which I've snipped for brevity).
Before anything else, I've been through a few of these conversions from SVN to git in other projects. In most of the ones I've seen going to submodules of multiple repo's, a lot of automation is required just to keep things manageable. That's hard to do on a cross-platform basis (do you script in Python, shell script, one per OS, etc.) and is really more trouble than it's worth -- especially when adding new submodules and/or removing them. They're not impossible to do, but they're also much more work than a single repo.
Just to point out some devil's advocate positions:
- Keeping the current structure will be less churn to existing consumers that have "out of tree" builds based on the current structure. Asking them to change their workflow with SVN significantly (since moving to GitHub is mostly swayed by the SVN interface) will probably be non-trivial amounts of work. We probably need to document this well enough or show that the switch won't affect them too badly.
- Some people value keeping the history of the commits in SVN and the Git counterpart once the move happens (for a lot of valid reasons). Making sure we can merge the histories of all the subproject repositories into a single one should be addressed to preserve "provenance".
- Some people like isolation of workflows and concerns. As a git-native convert, I'm not sold on this, but there's some good reasons to be able to do this (maintainers of certain projects will probably enforce different constraints on when/who/how changes can/should/must be made). Making it possible to do so in a monorepo should be explained well (i.e. does this need any special configs on the repo on the server side, on GitHub, etc.).
All in all I think optimising for the case of the everyday developer working on multiple projects (in my case LLVM, Clang, and compiler-rt, and maybe potentially XRay as a subproject too) is a good cause. Whether this translates to every special consumer of the current set-up is less clear at least to me -- so I'd like to know what other stakeholders here think.
Cheers
Running the same 'git checkout' commands on multiple repos has always
been sufficient to manage the multiple repos so far - as long as you
create the same branches and tags in each repo, it's easy[1] to manage
the set of repos with a script that cd's to each one and runs whatever
git command.
So it's a pretty minor inconvenience today to have the multiple repos in
the case where you want to check out all of them.
OTOH, if all of the repos are combined into one, you have to do work
when you only want some of them. In my experience, this is basically
always - between my various machines and projects I have a several
checkouts of llvm+compiler-rt+clang+libc++, and I have a lot of
checkouts of just llvm. I've only checked out the other repos when I was
changing APIs and needed to update them.
I haven't tried the options jlebar has described to deal with these -
sparse checkouts and whatnot, but they seem like an equivalent amount of
work/learning curve as writing a script that cd's to several directories
and runs the same git command in each.
Thus, this also sounds like a minor inconvenience. I just don't see how
trading one for the other is worth doing, since AFAICT they're equally
inconvenient.
[1] My understanding of the "umbrella repo" thing for bisecting is that
it'll be managed automatically by a cron or checkin hooks or
whatever, so the bit's in jlebar's description about updating
submodules seem like a red herring. I'm assuming that we end up in a
place where working with git is essentially the same as we work with
git-svn today.
This is true if you s/checkout/clone/. With a single repo, you must
clone (download) everything (*), but after you've done so you can use
sparse checkouts to check out (create a working copy of) only llvm and
clang. So you should only notice the fact that there exist things
other than llvm and clang when you first clone (download) llvm.
Either way switching to git is going to be a change from the status
quo. Personally I'm more interested in finding the best overall
solution than the solution which is "most similar" to the current
setup under some metric.
(*) Technically, if you do a shallow clone, you have to download a
single revision of everything. That's the 90mb number from my
original post.
So it's a pretty minor inconvenience today to have the multiple repos in
the case where you want to check out all of them.
OTOH, if all of the repos are combined into one, you have to do work
when you only want some of them. In my experience, this is basically
always - between my various machines and projects I have a several
checkouts of llvm+compiler-rt+clang+libc++, and I have a lot of
checkouts of just llvm. I've only checked out the other repos when I was
changing APIs and needed to update them.
I haven't tried the options jlebar has described to deal with these -
sparse checkouts and whatnot, but they seem like an equivalent amount of
work/learning curve as writing a script that cd's to several directories
and runs the same git command in each.
[1] My understanding of the "umbrella repo" thing for bisecting is that
it'll be managed automatically by a cron or checkin hooks or
whatever,
so the bit's in jlebar's description about updating
submodules seem like a red herring. I'm assuming that we end up in a
place where working with git is essentially the same as we work with
git-svn today.
So, we use that to a certain extent.
Linaro's GCC validation uses the full checkout, then do a shallow
checkout that only has the updates.
Our LLVM scripts, OTOH, clone all repos and use worktree for *all*
branches, and we only branch on the repos that we choose, for each
"working dir".
Our scripts probably would need certain modifications... but it should be fine.
But I'm not, by far, the most problematic user.
The real problem, and why people accepted sub-modules, is that a lot
of downstream people only use one or another projects. Mostly LLVM or
Clang or libc++.
Checking out all of it is bad, but having them officially interlinked,
it seems, is worse. IIUC, the problem is that the projects are now
built independently on their projects, but more and more CMake changes
are creeping in, making it harder and harder to separate their
projects from the rest of LLVM. This means they'll now depend on a
much larger body of sources that will need to be compiled together,
and will probably mean they'll abandon LLVM in favour of something
lighter.
I honestly don't know how big is that problem, I don't have it myself,
but I "can imagine" compiling LLVM and Clang without need would be
pretty bad.
Huh. It definitely hasn't worked well for me.
Here's the issue I face every day. I may be working on (unrelated)
changes to clang and llvm. I update my llvm tree (say I checked in a
patch, or I want to pull in changes someone else has checked in). Now
I want to go back to hacking on my clang stuff. Because my clang
branch is not connected to a specific LLVM revision, it no longer
compiles. I'm trying to build an old clang against a new llvm.
Now I have to pull the latest clang and rebase my patches. After I
deal with rebase conflicts (not what I wanted to do at the moment!),
I'm in a new state, which means when I build my ccache is no help.
And when I run the clang tests, I don't know whether to expect test
failures. So then I have to pop of my patches and run at head...
(Maybe I have to update clang! In which case I also have to update
llvm...)
This would all be solved with zero work on my part if llvm and clang
were in one repository. Then when I switched to working on my clang
patches, I would automatically check out a version of LLVM that is
compatible.
I think this is the main thing that people aren't getting. Maybe
because it's never been possible before to have a workflow like this.
But having a git branch that you can check out and immediately build
-- without any rebasing, re-syncing, or other messing around -- is
incredibly powerful.
Please let me know if this is still not clear -- it's kind of the key point.
As I said, you can accomplish this with submodules, too, but it
requires the complex hackery from my original email.
To me, this is not at all a minor inconvenience. It's at least an
hour of wasted time every week.
> I haven't tried the options jlebar has described to deal with these - sparse checkouts and whatnot, but they seem like an equivalent amount of work/learning curve as writing a script that cd's to several directories and runs the same git command in each.
I'll send sparse checkout instructions separately. But my example
submodules commands are not at all equivalent to a script that cd's
into several directories and runs a git command in each, and I think
this is the main point of confusion. (In fact you wouldn't need to
write such a script; it's just "git submodule foreach".)
The submodules commands creates a single branch in the umbrella repo
that encompasses the checked-out state of *all the LLVM subrepos*. So
you can, at a later time, check out this branch in the umbrella repo
and all the clang, llvm, etc. bits will be identical to the last time
you were on the branch.
If all you want is to continue using git the way you use it now, the
multiple git repos gets you that (as does a sparse checkout on the
single repo). My point is that, the move to git opens up a new, much
more powerful workflow with branches that encompass both llvm and
clang state. We can do this with or without submodules, but using
submodules for this is far more awkward than using a single repo.
-Justin L.
On Wed, Jul 20, 2016 at 5:36 PM, Justin Bogner via llvm-dev
You seem to imply that all the projects in the single repo would be built by default, while it is not part of the proposal.
Actually I’d expect an opt-in mechanism, so that: `mkdir build-llvm && cd build-llvm && cmake ../llvm` only builds LLVM.
—
Mehdi
I should clarify, this is a -0 kind of opposed. If people overwhelmingly
think this is the way to go, I won't try to block it or anything. I'd
rather not have to update a bunch of workflow, infrastructure, and bots
for no particular reason though.
> Also the minor inconvenience in the case of the monolithic repository
> is happening during the initial setup/clone/checkout, and not during
> day-to-day development (git pull, git checkout -b, git commit, git
> push), while the split model induces “minor inconveniences” in the
> day-to-day developer interaction.
> I.e. I prefer using a script to checkout and setup the repo, and then
> be able to use the standard git commands for interacting with it.
>
>
>> [1] My understanding of the "umbrella repo" thing for bisecting is that
>> it'll be managed automatically by a cron or checkin hooks or
>> whatever,
>
> That’s also something that is fragile to me without a deterministic
> way to reconstruct it identically from scratch using only the split
> repositories (which would be possible with "git notes” attached by a
> server-side hook for instance, but unfortunately github does not allow
> it, and the current split-repository proposal exclude even
> *discussing* the merits of other hosting services).
I haven't been following that discussion, but that seems surprising
since AFAICT the only particularly compelling reason to move away from
SVN is that it's easy to find good reliable hosting.
>
>> so the bit's in jlebar's description about updating
>> submodules seem like a red herring. I'm assuming that we end up in a
>> place where working with git is essentially the same as we work with
>> git-svn today.
>
> Some people manage today to have a single commit that update
> clang+llvm at the same time.
> I believe doing this in the split-repository model requires
> write-access to the umbrella repo.
_______________________________________________
—
On 21 July 2016 at 01:39, Justin Lebar <jle...@google.com> wrote:
> This is true if you s/checkout/clone/. With a single repo, you must
> clone (download) everything (*), but after you've done so you can use
> sparse checkouts to check out (create a working copy of) only llvm and
> clang. So you should only notice the fact that there exist things
> other than llvm and clang when you first clone (download) llvm.
So, we use that to a certain extent.
Linaro's GCC validation uses the full checkout, then do a shallow
checkout that only has the updates.
Our LLVM scripts, OTOH, clone all repos and use worktree for *all*
branches, and we only branch on the repos that we choose, for each
"working dir".
Our scripts probably would need certain modifications... but it should be fine.
But I'm not, by far, the most problematic user.
The real problem, and why people accepted sub-modules, is that a lot
of downstream people only use one or another projects. Mostly LLVM or
Clang or libc++.
Checking out all of it is bad,
but having them officially interlinked,
it seems, is worse.
We were originally trying to avoid too many moves at the same time.
There is already some CMake efforts to help build the different
repositories, but it's not linked to any proposal.
I think doing so would complicate both build system and version
control migrations...
--renato
$ git clone --depth 1 https://github.com/llvm-project/llvm-project.git
$ cd llvm
$ ls
clang clang-tools-extra compiler-rt dragonegg klee ...
$ git config core.sparsecheckout true
$ echo "/llvm
/clang" > .git/info/sparse-checkout
$ git read-tree -mu HEAD
$ ls
clang llvm
I suppose you could even wrap this in a script and ship that with
llvm, if you wanted.
This is not about me, it's about people that use LLVM projects elsewhere.
>> but having them officially interlinked, it seems, is worse.
>
> Why?
> Below it sounds like you want to do this as a way of enforcing projects to
> stay independent of each other.
Why every one take my comments as my own personal motives?
I'm just the "consensus seeker". None of these ideas are mine, I'm
just echoing what was said in 320 emails, plus what was said in the
past few years when people discussed about using pure Git.
People in the IRC were saying I had ulterior motives, that I was
pushing people to use GitHub or sub-modules, or whatever. This is
*really* not cool.
Every single thread so far has died down and I wrote a summary, and no
one said anything. Then I created another thread, and wrote another
summary. Once no one was disagreeing, I wrote the text.
Now every one wants to disagree again. Seriously?
I *personally* don't care if we use GitHub, or GitLab, Git or
mercurial. I don't care if we have sub-modules or a monolithic
repository, but I'm not the only user.
LLVM has, so far, taken the modular approach that other projects can
embed our projects on their products. Downstream commercial products
do that, other OSS projects do that, and that's pretty cool.
GCC has had a *huge* flying monster in the last decade because they
weren't modular enough and that has been the big difference of LLVM,
and why it gained traction on impossible partners, like Emacs.
If we're saying we want to close everything down and make a compiler
like GCC, that will make my life **MUCH** easier. So there is
absolutely *no* point in me pushing the other way.
But I'm not the only user... And I'd rather not be selfish.
If the consensus has changed from last week, or if no one has actually
read the emails and threads and want to do it all over again, please
be my guest.
I don't know man, when I create a branch to save my clang work I just
create a branch with the same name in all the other repos I have checked
out, then it just stays in the state I left it in as I go do other
stuff. This kind of problem just hasn't really come up for me.
If I do `git log` in a sparse checkout that just has LLVM, will it only
show me LLVM commits? That is, how easy is it to filter out the
clang/lldb/subproject-X commits from a log? Negative globs are kind of
awkward.
Ah, I understand your workflow now. That works, I guess. It's
definitely better than what I've been doing. :)
You have to write and use these scripts, of course. I think that's
the main problem -- git is hard enough as it is; asking me to do most
git commands completely differently when I happen to be working on
llvm is asking a lot. Even asking everyone to realize that there's a
better way is asking a lot. Inasmuch as we can make the commands we
type every day Just Work Like Any Other Git Repository, I think that's
a clear win for the community's overall productivity.
Beyond that, I guess the main benefits wrt workflow of the single repo
are that you can much more easily work with cross-cutting changes.
You can stash them, bisect them, reorder them, commit a bunch with one
command, whatever, there's nothing special about the fact that they're
cross-cutting.
And of course we don't get atomic commits across subprojects at all
without a single repo. That really would be nice for certain kinds of
changes.
But I think the bigger point wrt workflows is that there's a real
benefit to having fewer special snowflakes in our lives.
-Justin L.
On 21 July 2016 at 02:06, Daniel Berlin <dbe...@dberlin.org> wrote:
>> Checking out all of it is bad,
>
> Define bad?
> Time?
> Disk space?
> Bandwidth?
>
> I mean, we already assume you have a lot of each anyway?
This is not about me, it's about people that use LLVM projects elsewhere.
>> but having them officially interlinked, it seems, is worse.
>
> Why?
> Below it sounds like you want to do this as a way of enforcing projects to
> stay independent of each other.
Why every one take my comments as my own personal motives
I'm just the "consensus seeker". None of these ideas are mine, I'm
just echoing what was said in 320 emails, plus what was said in the
past few years when people discussed about using pure Git.
People in the IRC were saying I had ulterior motives, that I was
pushing people to use GitHub or sub-modules, or whatever. This is
*really* not cool.
Every single thread so far has died down and I wrote a summary, and no
one said anything. Then I created another thread, and wrote another
summary. Once no one was disagreeing, I wrote the text.
Now every one wants to disagree again. Seriously?
I *personally* don't care if we use GitHub, or GitLab, Git or
mercurial. I don't care if we have sub-modules or a monolithic
repository, but I'm not the only user.
LLVM has, so far, taken the modular approach that other projects can
embed our projects on their products. Downstream commercial products
do that, other OSS projects do that, and that's pretty cool.
GCC has had a *huge* flying monster in the last decade because they
weren't modular enough and that has been the big difference of LLVM,
and why it gained traction on impossible partners, like Emacs.
If we're saying we want to close everything down and make a compiler
like GCC, that will make my life **MUCH** easier.
So there is
absolutely *no* point in me pushing the other way.
But I'm not the only user... And I'd rather not be selfish.
If the consensus has changed from last week, or if no one has actually
read the emails and threads and want to do it all over again, please
be my guest.
cheers,
--renato
Why every one take my comments as my own personal motives?
I'm just the "consensus seeker". None of these ideas are mine, I'm
just echoing what was said in 320 emails, plus what was said in the
past few years when people discussed about using pure Git.
People in the IRC were saying I had ulterior motives, that I was
pushing people to use GitHub or sub-modules, or whatever. This is
*really* not cool.
Every single thread so far has died down and I wrote a summary, and no
one said anything. Then I created another thread, and wrote another
summary. Once no one was disagreeing, I wrote the text.
Now every one wants to disagree again. Seriously?
Before we can agree to merge to a single-repo, there's one further question that must be resolved:
Should the layout in the merged repository be:1) Like the "llvm-project" git repository is now:<root>/llvm/<root>/clang/<root>/compiler-rt...2) Like the "ideal merged checkout" is now:llvm/
llvm/tools/clangllvm/projects/compiler-rt...I don't much care which of those is chosen. I have a slight preference for #1, for ease of doing things like grep/log/etc on llvm by itself, excluding all the other projects. But either way seems probably fine, and an improvement over multiple repositories.
Justin Lebar via llvm-dev <llvm...@lists.llvm.org> writes:
> I would like to (re-)open a discussion on the following specific question:
>
> Assuming we are moving the llvm project to git, should we
> a) use multiple git repositories, linked together as subrepositories
> of an umbrella repo, or
> b) use a single git repository for most llvm subprojects.
>
> The current proposal assembled by Renato follows option (a), but I
> think option (b) will be significantly simpler and more effective.
> Moreover, I think the issues raised with option (b) are either
> incorrect or can be reasonably addressed.
>
> Specifically, my proposal is that all LLVM subprojects that are
> "version-locked" (and/or use the common CMake build system) live in a
> single git repository. That probably means all of the main llvm
> subprojects other than the test-suite and maybe libc++. From looking
> at the repository today that would be: llvm, clang, clang-tools-extra,
> lld, polly, lldb, llgo, compiler-rt, openmp, and parallel-libs.
FWIW, I'm opposed. I'm not convinced that the problems with multiple
repos are any worse than the problems with a single repo, which makes
this more or less just change for the sake of change, IMO.
No. This affects a large part of the LLVM community and llvm-dev is
the most universal place we have to discuss such issues at the moment.
Feel free to help us set up a better list so your delicate
collaborators aren't bombarded with e-mails if you want, but that's a
separate discussion.
Tim.
Sigh.
A bit of history.... as precise as I can muster.
1. Git's been on the back of our heads for a long time
2. Some event (can't remember) triggered a discussion on IRC where
some core devs were mostly in agreement
3. I decided to take on, folks were happy, I sent a huge email with
*ALL* options. Local hosting, external hosting (all), Git, Mercurial,
SVN, monolithic, or not, etc.
4. There were hundreds of emails, in many cycles, and in each step, I
took a step back, wrote everything that was being said (not what I
wanted), and waited for disagreement.
During this process, I also proposed "voting". Tanya, very helpfully,
said it would be better to have a survey, so we don't take hard
decisions based on simple counts. Chandler, also very helpfully said
we needed a concrete implementation example in which to base our
decisions. People seemed generally in favour to gauge the opinions
more generally, so having "A" proposal was better than having general
discussions.
For me, personally, any monolithic Git repository would do. But there
was a lot of feedback on it not being monolitic, and then on it having
sub-modules, so I was echoing the larger voice, not my own.
And the idea, at least as "I" interpreted it, was to have "some"
concrete example, and a wide survey.
No one said that:
* the result of the survey would dictate the move
* I would get to choose it alone (s/I/anyone/)
* there weren't better models
I specifically stated that, this was one of the models, which I was
trying to push through the survey in the interest of getting a feeling
for how people like it *really*.
I specifically said there could be other proposals, other surveys.
> Even a minimum of "if you look at what X said about Y in the thread", or
> something, would go a long way here.
>
> Otherwise you are basically saying "hey, i think i heard, in the past 300
> emails, X". That's not really something that one can respond to reasonably.
Over simplifying it is a bit offensive.
Taking one of my points in separate as if it means my whole argument,
each time, *is* over simplifying it.
> FWIW: I actually think the LLVM community ratholes on a lot of things, *way
> too much*. Not sure we are quite at that point yet on this.
Having a precise proposal and survey is one way many people proposed
to get out of the rat hole. People are generally more conscious in
surveys than replying to email threads, and any personal attack they
send is restricted to the idea (or becomes childish), which is really
what we want. Mailing lists are too prone to trolling to be an
effective consensus reaching place. We've seen our share.
So, a cyclic model with a proposal and a survey seem like a good thing
to do. GitHub+modules is not *my* proposal, but *our* first proposal.
That's why I added it to a "Proposals" directory in the docs, and why
I wasn't worrying too much if people liked it on the review. It is one
reflection of one discussion from one angle.
>> GCC has had a *huge* flying monster in the last decade because they
>> weren't modular enough and that has been the big difference of LLVM,
>> and why it gained traction on impossible partners, like Emacs.
>
> Errr, i'm not sure this is really the reason, but let's ignore that :)
Again, taking one point as if I meant *everything*, and over simplifying.
There was certainly a tone to GCC's predicament that was not being
modular enough (being used as a library, extending its AST to Emacs,
having external projects use it in some form), but I have made no
assertion as to what *it* is, or how important *it* is in the whole
scheme of things. You should make no assumption as to my intentions
other than a simple statement.
> I think you may need to move a *little* slower, FWIW. On one hand you are
> saying "there are 300+ emails", but you expect consensus in a week?
> That seems .. a bit much :)
> What if someone was on vacation last week?
The threads lasted for 1 1/2 months, after "soft" discussions for years.
I didn't expect consensus "on the whole problem" in a week, just
consensus on the first proposal, GitHub+modules, which seemed had
already been reached weeks before.
> So like I said, if you are going to seek consensus and drive it by voicing
> the concerns of others, that's great. I applaud it. But when doing it, you
> may want to make clear that is what you are doing, and who said what, so
> that the right people can be cc'd with the right responses, etc
You have no idea how many times I read the same emails over and over
to make sure I cite the right person. That's why I have consistently
re-written a summary of every thread, with proper quotes and
everything.
But as you said, we tend to not get out of rat holes, and there is so
much I can cope with to go back reading the same emails.
A lot of what happened is that a number of people (and I'm being
purposely vague) are opposed to it, and are raising the same concerns
over and over, even though there were arguments to refute what they
are saying.
How many times more do I need to go back, read the emails again and
quote what people said, so that people can feel comfortable? Is that
really the best use of *our* time? Keeping ourselves in rat holes?
I personally think not. And why I wanted to get at least one proposal
out and see what people thought of it.
This may be entirely the wrong approach, and I accept your arguments,
and that's why I wrote "be my guest". It wasn't out of spite, but I'm
really saying, "please do it".
However, I'd really like if people would stop the personal attacks.
Reiterating, this is not *my* proposal.
So, from now on...
* I've made my part and got "consensus" for one proposal. It is what it is.
* Justin is forming consensus on the monolithic version. This is a
*different* proposal, so it needs to take into account hosting service
and everything else we did in the first.
* Please add a similar document to "docs/Proposals" at the end.
* Repeat.
I'll refrain from driving any other proposal in the interest of mental
sanity (and personal time), not because I support the GitHub+modules
proposal.
When everyone is happy that we have enough proposals, Tanya's survey
should be brought forward, in which case I'll gladly offer my help
again.
I hope this is clear enough and people will stop second guessing me.
regards,
Same here.
> I'm also really sad to hear that people have been impugning your motives,
> because you've done a tremendous amount of work to bring this to a
> conclusion, and it really ought to be clear to everyone that you've been
> doing an admirable job of driving towards consensus here, and basically
> nothing more.
Thank you. Appreciated.
> IMO, the only reason we can even have this conversation about a single-repo
> reasonably now is because of your work in writing up clearly the scheme for
> a multi-repo solution. So I hope you don't feel discouraged by this turn of
> events! I personally put the entire credit of getting to this point on your
> hard work.
Haven't though of it that way. I feel better already. :)
> I don't much care which of those is chosen. I have a slight preference for
> #1, for ease of doing things like grep/log/etc on llvm by itself, excluding
> all the other projects. But either way seems probably fine, and an
> improvement over multiple repositories.
I don't have a strong preference, but #1 proponents weakly convinced
me with two arguments:
1. it is easier to mix-and-match repositories as you like
I'd still symlink as I do today, but I can see why this would be
interesting for off-tree users.
2. it "makes more sense" to let Clang *use* LLVM instead of LLVM *host* Clang
this seems more preference than anything, but people that know CMake
more than I do said it would be "easier" and I trust them. I have no
technical arguments pro or against.
Though, I'd be fine with anything really.
There was a separate thread where people seemed in favour of a
different list. It's probably only a matter of time before the
foundation creates it.
--renato
Right - I was assuming a layout where the subprojects are already in the
places they need to be checked out to. With llvm-project's layout my
question is silly.
llvm-project's layout is kind of annoying, since with that I have to
check out all of the repos yet I still need to add symlinks or something
to actually use any of them. It also means that anyone who is only using
llvm has to change their paths from /path/to/llvm to /path/to/llvm/llvm,
which is a little bit ugly.
Taking one of my points in separate as if it means my whole argument,
each time, *is* over simplifying it.
You have no idea how many times I read the same emails over and over
to make sure I cite the right person. That's why I have consistently
re-written a summary of every thread, with proper quotes and
everything.
James and I owe you something here. I think this can be handled in a
straightforward manner, but I am not 100% sure how at the moment. I
agree this is very important.
Our demo would be much more compelling if we can use an existing
branch. Does anyone know of one we can play with?
> In particular, the fact that we have a third more public GitHub forks of LLVM than of clang, and eight times as many as of lldb implies to me that forcing everyone downstream to pull in all subprojects would not be particularly well received.
I have a hard time understanding this particular argument. Per the
original e-mail, with three shell commands, you can hide whichever
llvm subprojects you want. After doing that, the only overhead of the
subprojects is extra space in your .git directory, which would still
be much smaller than an llvm+clang objdir.
Is there something specific that you think will not be well-received?
Or maybe it's better to speak personally -- is there something
specific that will bother you personally about having to clone (but
not check out) everything?
With the single repository approach, maintaining a long-running branch
that touches multiple subprojects (e.g. llvm and clang) becomes *far*
simpler.
With the umbrella repo, you have to do the submodules trickery I
described in the original e-mail. It is complicated, and takes a lot
of typing (or requires you to develop custom scripts). But with the
single repo, this cross-cutting branch is just a branch.
In fact even if your branch isn't cross-cutting, if it's not a branch
of LLVM proper, I'm curious how you'd do things like bisect the
branch, or even just check out and build an old version. You check
out an old version of the (say) clang branch, and then presumably you
try to figure out the corresponding version in the LLVM repo that you
need to check out. I guess you could find the upstream parent of your
branch, get the SVN revision number from the commit message, then go
to the LLVM branch and find a commit which has an SVN number that's
nearby?
This would all become as simple as "git checkout" under the monolithic model.
On Jul 20, 2016, at 11:03 PM, Renato Golin via llvm-dev <llvm...@lists.llvm.org> wrote:When everyone is happy that we have enough proposals, Tanya's survey
should be brought forward, in which case I'll gladly offer my help
again.
Which projects do we put under this monolithic repository?
SVN has about 42 projects, some of them dead, some of them in life support.
So far, being "an upstream repository" meant being inside the LLVM SVN
server. We'll change that to "being inside the monolithic LLVM
repository". But this can become huge, and not all projects "ink" back
to LLVM.
An alternative would be to just have some core projects in the
monolithic and everything else as separate, but then what's core?
As a back-of-the-envelope, I suggest: llvm, clang, clang-tools-extra,
compiler-rt, libc++, libc++abi, libunwind, test-suite.
I'm thinking LLD and LLDB could remain out, but I don't think it would
be too weird for them to be in...
Anything else? Less?
cheers,
--renato
As part of any potential migration, everyone involved must start to
accept certain changes, (large or small) to the workflow. The big
challenge here isn't technical, it's mindset. It's convincing any
group of people who object that it won't be as painful to them as they
think. (I hope this is a true statement)
#if - there's a group of people are : dogmatic, stubborn and
unreasonable - others outside that group should decide how to deal
with them: ignore, coddle, placate or other.
I don't think there's a perfect technical solution to make everyone
happy - I think focusing on the social engineering will be an equal or
greater importance. (herding cats)
With the survey - I guess you could include some level of objection
like - strongly against and over my dead body type reactions are
probably the most to be cautious about. Anyone surveyed who fall in
the middle or slightly left/right can be seen as "flexible". If it
turns out that they survey shows only 1-5 people with extreme views
and 100 people with moderate or flexible views - those are hard
numbers. From there decisions can be made and long unending threads
like this can die - so we can all get back to reading more important
things.
but it interlocks with libunwind and compiler-rt...
> The same applies to libunwind. If you’re building an entire toolchain then you might want to use it, but most projects don’t benefit from it and it implements a well-defined standard ABI and so doesn’t need to be updated in lockstep with anything else.
Using RT without libunwind on ARM is weird. libgcc_s has some of the
functionality, but the split between libgcc, _s and _eh is not the
same as compiler-rt, libc++abi and libunwind.
If one want's a reasonable solution, one (today) needs to include all
three. Then why not libc++? I mean, GCC does build libstdc++ in tree
already, so it wouldn't be unheard of.
> clang-tools-extra is explicitly a bunch of stuff that doesn’t belong in the main clang repo because it’s not of interest to most people doing clang work, so it’s hard to see why it would be of interest to everyone doing LLVM work. Additionally, I believe that they’re mostly things that are built on top of APIs in clang that are supposed to be moderately stable, so shouldn’t need atomically updating with respect to clang very often.
ok, no strong feelings about it.
> Compiler-rt probably makes sense if clang is there, as it includes a lot of the run-time support for clang.
RT strongly fits into the core. If there a minimal-minimal core to be
set, that'd be { llvm, clang, RT }. If not for what it can do today,
for what it should do in the future.
The proposal at the moment is to include
llvm, clang, clang-tools-extra, lld, polly, lldb, llgo, compiler-rt,
openmp, and parallel-libs.
This is the set {llvm} plus the transitive closure of "projects that
are version-locked to a project in the set", where the closure is
taken over the set of all active LLVM subprojects.
Projects that don't depend on a specific version of llvm or some other
subproject -- test-suite and libc++ -- are not included. Everything
else is, because the whole idea is to have one repository that
captures the implicit versioning dependencies between (say) lldb and
llvm.
As soon as we have one version-locked subproject that's not in the
monolithic repo, we now we have to maintain an umbrella repo that
tells you which version of llvm corresponds to which version of the
version-locked-but-not-in-monolithic repo.
The cost of including additional projects in the monolithic repository
is very low, since you can ignore them using sparse checkouts.
Not true - the non-gnu libunwind which is outside of the llvm family
of projects works just fine on AArch64. We're dealing with hopefully
standard interfaces and A1 vs A2 comparisons.
Again - this is thread is digressing - wrong solution to wrong problem
- Renato step back and try not to let the thread get detracted
Before we can agree to merge to a single-repo, there's one further question that must be resolved:Should the layout in the merged repository be:1) Like the "llvm-project" git repository is now:<root>/llvm/<root>/clang/<root>/compiler-rt...
2) Like the "ideal merged checkout" is now:llvm/
llvm/tools/clangllvm/projects/compiler-rt...
On 7/21/16 11:03 AM, C Bergström via llvm-dev wrote:
> Monolithic is trying to solve the wrong problem - it's that simple.
> Any discussion or attempt to coddle those who think it's necessary is
> a waste of time. #dictator
Christopher,
AFAICT, you haven't explained *why* it is the wrong problem. Mind
elaborating on that?
Jon
p.s: edicts, appeals to authority, and ad hominems are not useful for
discussion. Doing that, and following up with "#dictator" further
solidifies that you know your own argument is b-s.... please stop.
--
Jon Roelofs
jona...@codesourcery.com
CodeSourcery / Mentor Embedded
What do people think of having one (or a set of) merge commit(s)
merging in the non-llvm projects that will be part of the new
monorepo? That's the only technique I can think of that will preserve
history for downstream users by construction.
-- Sanjoy
--
Sanjoy Das
http://playingwithpointers.com
This would solve the problem of importing history. But if we did it
this way, you'd be unable to check out versions of the complete repo
from before the merge date. So you'd be unable to bisect back before
the merge date, for example. I think the umbrella repo might be a
better solution than one which had that property.
I don't know if there's a way to allow checkouts of everything from
before the merge date while also making the custom branch merge to the
monolithic repository as trivial as "git merge". I think it may
depend on git's handling of file renames, and if so...I am not too
hopeful. :)
For at least David's branches, I think it would be really cool if we
could merge the llvm and clang branches into a single branch with
correct history. We wouldn't be able to do that if we used git merge
to build the monorepo out of its constituent pieces.
I really, *really* would like to see libc++ / abi / unwind. :)
My reason is that, when building toolchains, the C++ ABI and unwinding
are fundamental parts of the run-time library, of which RT is only
part of.
RT has the builtins (and a lot of other stuff), but it can't unwind on
its own. So debuggers (LLDB), profilers (which lives in RT) and basic
stack traces don't work, unless you use an alternative option (like
libgcc). This is *specially* true for ARM.
When unwinding C++ code, one needs cxa_* functions, and that's in
libc++abi, which interoperates with libc++, unwind and RT.
The LLVM triple abi/unwind/RT is not divided in the same way as
gcc_eh/gcc_s/gcc, so picking some but not others is not a sane option.
Plus, validating every possible choices needs one buildbot for each
combination, which is not feasible, at least not for us.
Basically, picking RT and not unwind/abi breaks their
inter-dependencies, so does picking abi but not libc++.
> Projects that don't depend on a specific version of llvm or some other
> subproject -- test-suite and libc++ -- are not included. Everything
> else is, because the whole idea is to have one repository that
> captures the implicit versioning dependencies between (say) lldb and
> llvm.
I'm fine with the test-suite not being in the core, but the others
will make it very hard to build actual toolchains.
They're also reasonably small, rarely updated and self-contained, so I
don't see why they can't be there.
On 21 July 2016 at 18:12, Justin Lebar <jle...@google.com> wrote:
> llvm, clang, clang-tools-extra, lld, polly, lldb, llgo, compiler-rt,
> openmp, and parallel-libs.
I really, *really* would like to see libc++ / abi / unwind. :)
Makes sense.
We work in RT and have a strong vested interest in libunwind, and for
us, having them bundled would be a major win.
libc++/abi are more as dependencies, but would also be much nicer
bundled. Marshall may have a better view on that specific subject.
On Jul 21, 2016, at 2:11 PM, Chandler Carruth via llvm-dev <llvm...@lists.llvm.org> wrote:On Thu, Jul 21, 2016 at 1:55 PM Renato Golin via llvm-dev <llvm...@lists.llvm.org> wrote:On 21 July 2016 at 18:12, Justin Lebar <jle...@google.com> wrote:
> llvm, clang, clang-tools-extra, lld, polly, lldb, llgo, compiler-rt,
> openmp, and parallel-libs.
I really, *really* would like to see libc++ / abi / unwind. :)FWIW, I agree for all the reasons you outline.
Nobody downstream has to adopt the new structure, I believe it is possible to extract only the “llvm” commits from the new repo and rebase them on top of the existing llvm repo.
This can be done on the fly by you CI, but it is also a deterministic process, i.e. you can restart from scratch anytime (assuming you have the original llvm.git repo and the new one).
>
> What do people think of having one (or a set of) merge commit(s)
> merging in the non-llvm projects that will be part of the new
> monorepo? That's the only technique I can think of that will preserve
> history for downstream users by construction.
I have no idea what you mean here?
—
Mehdi
I think I understand what you mean:
1) checkout the existing clang repo
2) move everything in a subdirectory “clang”
3) commit the move
4) merge this into the new “llvm-project”.
5) repeat for every single project
That should preserve the hashes and avoid user to have to “extract” the subproject to merge into their own branch.
Annoyingly, it breaks git log path/to/file though.
—
Mehdi
I know... :)
> If we should change our minds later we can opt-in to anything else we
> want (libcxx etc, lld? lldb? who knows) but in the meantime they are
> unnecessary baggage for my purposes.
I really see no way of doing this without bikeshedding, other than do
what Mehdi suggested and put all non-dying projects.
Cloning the first repo could be bad, especially for some of our
boards, so I won't propose it myself. Setting up NFS is rarely an
option (support, stability), so we will suffer for including lld,
lldb, etc. I'd prefer to see them out.
Since I'm not the one trying to reach consensus in this thread, I'll
just state what would be best for me and let Justin collect the
opinions. :)
I'm honestly fine with whatever decision, as we can usually work
around the problems, and it's probably cheaper than to bikeshed to
death.
cheers,
--renato
Use `git log --follow path/to/file`. It's better ;)
I know, it works most of the time for log, but how do blame it at a revision older than the move?
—
Mehdi
As a developer, you can checkout part of the repo with sparse-checkout.
As a downstream integrator, you can filter out the repo history as you want before merging into your repo.
—
Mehdi
On Jul 21, 2016, at 2:32 PM, Mehdi Amini via llvm-dev <llvm...@lists.llvm.org> wrote:
On Jul 21, 2016, at 2:29 PM, Mehdi Amini <mehdi...@apple.com> wrote:On Jul 21, 2016, at 11:03 AM, Sanjoy Das via llvm-dev <llvm...@lists.llvm.org> wrote:
FWIW, like David Chisnall, we (Azul) have a problem with rewriting
history.
Our LLVM fork has O(100) changes diverging from upstream
(though our branching structure is simple), and keeping all of that
history is important.
Nobody downstream has to adopt the new structure, I believe it is possible to extract only the “llvm” commits from the new repo and rebase them on top of the existing llvm repo.
This can be done on the fly by you CI, but it is also a deterministic process, i.e. you can restart from scratch anytime (assuming you have the original llvm.git repo and the new one).
is superior to the submodule thing
which can be maintained centrally by people who actually understand how to
do it.
As a downstream integrator, you can filter out the repo history as you
want before merging into your repo.
Hmmm maybe, maybe not. It sounds like the claim is: you can do a sparse
checkout of upstream, then merge it to a different branch, and get only
the history of the stuff that was sparsely checked out.
I really like your idea of having a few "projected" git repositories
(i.e. capture all commits that touch llvm/ into llvm.git, all that
touch clang/ to clang.git etc.). I think it should solve our problem
of llvm-forks-with-downstream changes very nicely (I think we won't
have to do anything, as you said). I still want to sleep on it to see
if I can spot any issues.
@David Chisnall and others with local forks: can you spot any
potential issues with Mehdi's plan? Are there cases where it won't
work?
-- Sanjoy
I wanted to try to merge David's llvm and clang branches into a single
branch -- that would be a big usability improvement over the current
situation. But there isn't enough information in the repositories to
recover the correct interleaving. You could try to order by date, but
that only works so long as the history is linear... So I gave up on
that feature.
I also kind of like the idea of these projected repositories, and if
that's sufficient, awesome, save us some work.
Developer time, barrier to entry for new contributors. Getting the sparse-checkout business right looks like it is actually non-trivial and not recommended for the git novice. *Changing* the sparse-checkout configuration later appears to be fraught with peril (easy to get wrong).
The claim is to keep the existing history (I.e. not hash changes) that is currently at http://llvm.org/git/llvm.git and continue to accumulate there any new commit that would touch the llvm subdirectory of the unified repo.
This would be a read-only view of course, but just like it is now.
Hmmm so there's still a per-old-project view? Missed that aspect, sorry… it would let us preserve our processes in terms of integrating the flow from upstream, although being able to get a correctly linearized flow of commits from the unified repo would be preferable and we would *want* to change over. Still not clear how to make that work with a sparse checkout.
--paulr
It's eminently copy-pastable, and there is no possibility of data loss.
I understand it's not zero cost, but I have trouble seeing how there's
a meaningful comparison between
- the cost of three copy-pastable commands run once, versus
- the benefit of simplifying the git commands we all run tens or
hundreds of times a day.
> *Changing* the sparse-checkout configuration later appears to be fraught with peril (easy to get wrong).
If you get it wrong, you don't have the right files in your checkout,
and you get a build error about a missing file...
Here too, I get that there's a nonzero possibility that one could
screw this up and get themselves into trouble, but when I actually do
the cost/benefit analysis, it is very hard for me to see how the costs
are anywhere near the same magnitude as the benefits.
This is absolutely on the table as far as I'm concerned. In a world
with separate repos it might make sense to use the presence or absence
of particular source files to trigger building (or not) a particular
project, but that makes little sense with a monolithic repository.
(I mean, it doesn't personally affect me because I never type plain
"ninja" -- I always do "ninja check-clang" or whatever. But that's
just *my* messed up workflow. :)
-Justin
On Jul 21, 2016, at 2:32 PM, Mehdi Amini via llvm-dev <llvm...@lists.llvm.org> wrote:
On Jul 21, 2016, at 2:29 PM, Mehdi Amini <mehdi...@apple.com> wrote:On Jul 21, 2016, at 11:03 AM, Sanjoy Das via llvm-dev <llvm...@lists.llvm.org> wrote:
FWIW, like David Chisnall, we (Azul) have a problem with rewriting
history.
Our LLVM fork has O(100) changes diverging from upstream
(though our branching structure is simple), and keeping all of that
history is important.
Nobody downstream has to adopt the new structure, I believe it is possible to extract only the “llvm” commits from the new repo and rebase them on top of the existing llvm repo.
This can be done on the fly by you CI, but it is also a deterministic process, i.e. you can restart from scratch anytime (assuming you have the original llvm.git repo and the new one).
What do people think of having one (or a set of) merge commit(s)
merging in the non-llvm projects that will be part of the new
monorepo? That's the only technique I can think of that will preserve
history for downstream users by construction.
I have no idea what you mean here?
I think I understand what you mean:
1) checkout the existing clang repo
2) move everything in a subdirectory “clang”
3) commit the move
4) merge this into the new “llvm-project”.
5) repeat for every single project
That should preserve the hashes and avoid user to have to “extract” the subproject to merge into their own branch.
Annoyingly, it breaks git log path/to/file though.
On Wed, Jul 20, 2016 at 7:08 PM James Y Knight via llvm-dev <llvm...@lists.llvm.org> wrote:Should the layout in the merged repository be:1) Like the "llvm-project" git repository is now:<root>/llvm/<root>/clang/<root>/compiler-rt...2) Like the "ideal merged checkout" is now:llvm/
llvm/tools/clangllvm/projects/compiler-rt...I don't much care which of those is chosen. I have a slight preference for #1, for ease of doing things like grep/log/etc on llvm by itself, excluding all the other projects. But either way seems probably fine, and an improvement over multiple repositories.FWIW, I strongly prefer #2, but I think the high order bit is the repository question.
I’ll start by saying I’ve skimmed this thread and am not actually a user of LLVM at all, but had some git thoughts that might be worth contributing.
> On 22 Jul 2016, at 01:16, Sanjoy Das via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> @David Chisnall and others with local forks: can you spot any
> potential issues with Mehdi's plan? Are there cases where it won't
> work?
One potential “issue” is that a single commit into the monolithic repository would potentially touch multiple subprojects (that’s one of the advantages). Projecting that into individual repositories would only commit changes to those files, but the commit message would be maintained and might therefore be confusing in the context of the individual repository, especially if only a small part of the commit affects that individual sub-repo.
Essentially if the projects are “supposed” to be separate modules, then submodules is the solution to enforce that independence, ensuring commits in each module only affect that module and have appropriate commit messages for that context.
If the submodules are in practice more intertwined then that then it does feel like an ideologically pure solution that in the end just gets in the way of developer productivity.
I’ve got a setup here that uses a hierarchy of submodules, so there is a “combined” submodule that just ensures that it’s children (other submodules) are at mutually compatible versions. That helped productivity (multiple consumers of the “combined” submodule don’t need to manually track versions of all the children) but this discussion is pushing me towards the thought that actually a monorepo would be a more productive solution anyway, and make more sense for cross-cutting changes.
And sorry to throw another option into the ring; and one that might already have been discussed and discounted, but thought it worth sharing.
1) Create a new llvm-project-mono repo
2) Use git subtree instead of git submodule to add all the directories to match the layout of llvm-project.
3) From now on, all commits go to the monorepo
4) monorepo commits can be projected to the individual project repos, and additionally a new commit on llvm-project can be made with the submodule version updates
Advantages:
- No change for existing downstream users unless they want to move to the mono view
- Easier developer experience for cross-cutting changes
- Git log by path would work identically on either view of the repository
- Hashes from before the creation of the mono repo would match in both views - the mono repo will have multiple roots but that’s not unusual with git subtree
Disadvantages:
- Step 4 from my list would need a script to keep things updated. A server-side hook would be best. The mapping is deterministic (every mono repo commit will map to one commit in any affected submodules and one “submodule update” commit in the umbrella llvm-project repo), so if the server responsible falls over the updates might be delayed but can be caught up without losing anything
- Less ideologically pure in terms of trying to keep the modules independent
- Commit hashes will diverge between the two views from the creation of the mono repo, making comparisons / merges between clones of the different views more difficult
Simon
Hi all,
I’ll start by saying I’ve skimmed this thread and am not actually a user of LLVM at all, but had some git thoughts that might be worth contributing.
> On 22 Jul 2016, at 01:16, Sanjoy Das via llvm-dev <llvm...@lists.llvm.org> wrote:
>
> @David Chisnall and others with local forks: can you spot any
> potential issues with Mehdi's plan? Are there cases where it won't
> work?
One potential “issue” is that a single commit into the monolithic repository would potentially touch multiple subprojects (that’s one of the advantages). Projecting that into individual repositories would only commit changes to those files, but the commit message would be maintained and might therefore be confusing in the context of the individual repository, especially if only a small part of the commit affects that individual sub-repo.
*Please*, stop calling it *my* proposal. It absolutely wasn't.
I'll repeat, as people seem to prefer repeated arguments than
references to past emails, but:
* I have sent a number of concerns and options
* People have favoured GitHub with sub-modules (I hadn't)
* I summarised the first proposal, which seemed to be reaching consensus
Let's call it "First Proposal" or "GitHubSubMod" proposal:
http://llvm.org/docs/Proposals/GitHubSubMod.html
Everything "out-of-discussion" on the first proposal was in the
interest of reaching a self-contained proposal, and had absolutely no
ulterior motive.
Now the proposal is there, best we could make it. If there are
technical flaws, by all means, send a review to that document, but you
can't change that proposal into something else.
You can, however, create a new one, and that's what you're doing.
As people said earlier, getting to know one proposal well, has shown
many people that the "consensus" might not have been the best way
forward, but that was only possible by actually finalising at least
one proposal.
My assumption was that a survey would take us to the next step
(finding the precise and impersonal problems with that proposal), but
it seems I didn't need that. I stand corrected.
One thing your proposal doesn't even touch is where the repo will be.
I know it's basically orthogonal, but it's one of the key reasons why
we need to move. I have no preference, as long as the solution is
maintainable and cater for our needs.
My personal opinion is to host somewhere professional unless there's a
good reason not to.
If we use external hosting, GitHub is the best because there are
already thousands of forks (see Chisnall's email) there already, and
people do come to the list thinking the GitHub repo is our official
one.
If we don't, we'll have to understand the costs and who's going to
maintain it (volunteer vs. hired help). Relying on volunteers (like
myself) is extremely risky and I'd very much rather not go that way.
Relying on any company can create bias (or the impression of bias),
which can divide the community.
Again, I'm not pushing *any* agenda, just laying out the issues. But
if you want to compete with the first proposal, you *have* to have a
complete proposal, with all the pros and cons clearly laid out.
cheers,
--renato
PS: We may need a grid of proposals ({external, local} x {submod,
monolithic})...
> Anyway, not trying to derail the discussion, just express that there are likely many others like me out there who are silent not because we don’t have an opinion, but because we just want git and don’t want to have an excessive number of +1’s on a thread saying so.
+1 :-)
There's another reason I've been staying quiet too which is that past experience with VCS migrations has taught me that people tend to over-value some things and that discussion tends not to convince people in advance of direct experience. I think some of these topics will end up being moot once we've moved to git and gotten used to it. For example, I've seen talk of wanting to preserve linear history which is understandable since it's quite nice to have. However, I suspect we'll drop that after a month or so as people find 'git push' doesn't work very well on a high traffic repo and start looking for alternatives. At that point I think we'll end up switching to pull requests and accepting non-linear history. Similarly, I think the desire for incremental revision numbers will gradually fade as people get used to git.
From: llvm-dev [mailto:llvm-dev...@lists.llvm.org]
On Behalf Of Pete Cooper via llvm-dev
Sent: 21 July 2016 17:46
To: Renato Golin
Cc: LLVM Developers
Subject: Re: [llvm-dev] [RFC] One or many git repositories?
Thanks for driving this Renato. It going to be a huge benefit to everyone once we have a solution in place.
On Jul 20, 2016, at 11:03 PM, Renato Golin via llvm-dev <llvm...@lists.llvm.org> wrote:
When everyone is happy that we have enough proposals, Tanya's survey
should be brought forward, in which case I'll gladly offer my help
again.
Regarding the survey specifically, and since I didn’t see a thread discussing survey options, I’d love to have a ‘I don’t mind what the solution is, I just want git’ option. Basically, ‘any of the above’.
For me, I’m very happy with the proposals being discussed, but mostly just want to move to a more reliable hosting service (full disclosure, I’m a fan of GitHub), and I use git-svn anyway so native git would be best for me.
Anyway, not trying to derail the discussion, just express that there are likely many others like me out there who are silent not because we don’t have an opinion, but because we just want git and don’t want to have an excessive number of +1’s on a thread saying so.
Cheers,
Pete
This is valid on a monolithic model, and that is one of the reasons I prefer it.
Today, I personally prefer the Git model (merges, pull requests, fuzzy
history), but I haven't always done so. The more I learnt how to use
Git, the more I realised how valuable the "confusing model" is for
distributed development.
Trying to force Git into an SVN model for the long term feels like
creating a niche that will be hard to work with (no hard evidence,
pinch of salt and all that).
I don't maintain a downstream fork, so I can't speak for that niche.
But forks in GitHub (all, not just LLVM's) seem to be fine merging
their patches over the original repository.
What this feels to me is that we were too complacent with the old
model and were slowly creeping Git support in an SVN world, and now we
realised how unusual is our "requirements".
Maybe you're right. Maybe moving to yet another model that satisfies
those requirements would be a step back, because we'd be setting in
stone a rule that was accommodated, not designed.
Maybe we should propose a third model: Use Git like Git. Pull requests and all.
As a quick recap of the things could go wrong, here's a
back-of-the-envelope idea of what could go wrong...
Changes that are the same as in linear monolithic core with external projects:
* the repositories themselves will have to adapt
* the build system (CMake and all)
* how the non-core repositories interact (relates to build system, bisect)
* all public forks (GitHub and others)
* all downstream forks (Many current LLVM active development affected)
New problems will be created:
* public and downstream forks that *rely* on linear history
* validation (buildbots will have to be re-written, or we'd have to
move to Jenkins, pull-request testing, etc)
* bisection (all our current tools will have to understand Git)
* library dependencies will be hard to bisect, because they won't be
in the same repository with the same history. This happens today in
GNU-land with binutils, glibc, etc.
All in all, not *that* different from the linear monolithic proposal,
and in my opinion, a future facing design, not a past driven
conformance.
cheers,
--renato
The original idea was to change one thing at a time. SVN to Git, keep
everything else the same.
But that has proven harder than we imagined. So, maybe the best way
forward is not to do one step at a time, but to understand where we
are and what we need and take the "right" (tm) step forwards. Even if
it requires multiple steps, we can combine them into larger, fewer
steps.
>> * public and downstream forks that *rely* on linear history
>
> Do you have an example in mind? I'd expect them to rely on each 'master' being
> an improvement on 'master^'. I wouldn't expect them to be interested in how
> 'master^' became 'master'.
Paul Robinson was outlining some of the issues he had with git
history. I don't know their setup, so I'll let him describe the issues
(or he may have done so already in some thread, but I haven't read it
all).
> Assuming the goal is to preserve what we have rather than improve it, buildbot
> will be fine without any changes (beyond switching the source steps from svn to
> git of course) whichever model we pick. It would just check out the latest 'master'
> on each build like it currently does for trunk.
I meant Zorg and the like. Buildbot itself can handle Git, but we may
have assumptions that the repos are linked and linear in the builders.
But we have been discussing pre-commit testing for a while and it's
clear that Buildbots, in the way they're setup now, are not the
answer.
For the sake of the argument, here is the list of things we found:
* buildbots can have pre-commit testing via patch submission, but
controlling security and load is not trivial if we want people to
actually use it
* buildbots tracking non-master branches have the load problems if we
allow people to create branches, but not the security problems
* having a mirror so that bots track that mirror would solve the
security and load problems, but remove the ability for other people to
use it.
In essence, buildbots are single purpose and hard to configure (much
of it needs master restart).
OTOH, Jenkins can have configurable build scripts, with parameters and
customisations, that allow for us to pick pull requests and build
them, as they come.
It also scales independently, per architecture, from the number of
configurations, if you can use something like containers. So, in the
long term, it's cheaper and more robust to maintain.
However, it's a big change and will require another massive change in
how we do things, and the repository is already big enough.
Do you have an example in mind? I'd expect them to rely on each 'master' being
an improvement on 'master^'. I wouldn't expect them to be interested in how
'master^' became 'master'.
Historically, we use buildbots like we do as a way to work around the
fact that SVN doesn't have pull requests.
> The ease in git of branching -- and more importantly rebasing the branch on
> a later state of master -- means that you can run buildbots for all the
> different platforms on each pull request BEFORE merging it to master.
Indeed, this would be a *great* improvement.
> If buildbots are not fast enough to test every change (let alone repeatedly)
> then you can keep a pristine "master" head and a "proposed master" head that
> might have several pull requests added onto it sequentially. Then have the
> buildbots test the "proposed master" and if it passes then fast-forward
> advance the "master" head to the current "proposed master" head. Then merge
> the next batch of pull requests onto "proposed master", rinse and repeat.
We don't need to turn off the current post-commit bots, though. We
don't even need to use buildbots for pre-commits.
The current bots are good at covering the basics, like a last line of
defence. For pull requests we could have a simplified *additional*
testing that would pick the majority of the breakages.
That could be Jenkins or something else, that can drive configurable
builds through a large shared pool of resources, which is much more
suitable to pre-commit testing.
These would have to be *only* fast builders (~30min or less) and
should cover different targets. We should aim to have at least one per
supported target.
Dear all,
I would like to (re-)open a discussion on the following specific question:
Assuming we are moving the llvm project to git, should we
a) use multiple git repositories, linked together as subrepositories
of an umbrella repo, or
b) use a single git repository for most llvm subprojects.
The current proposal assembled by Renato follows option (a), but I
think option (b) will be significantly simpler and more effective.
Moreover, I think the issues raised with option (b) are either
incorrect or can be reasonably addressed.
Specifically, my proposal is that all LLVM subprojects that are
"version-locked" (and/or use the common CMake build system) live in a
single git repository. That probably means all of the main llvm
subprojects other than the test-suite and maybe libc++. From looking
at the repository today that would be: llvm, clang, clang-tools-extra,
lld, polly, lldb, llgo, compiler-rt, openmp, and parallel-libs.
Let's first talk about the advantages of a single repository. Then
we'll address the disadvantages raised.
At a high level, one repository is simpler than multiple repos that
must be kept in sync using an external mechanism. The submodules
solution requires nontrivial automation to maintain the history of
commits in the umbrella repo (which we need if we want to bisect, or
even just build an old revision of clang), but no such mechanisms are
required if we have a single repo.
Similarly, it's possible to make atomic API changes across subprojects
in a single repo; we simply can't do with the submodules proposal.
And working with llvm release branches becomes much simpler.
In addition, the single repository approach ties branches that contain
changes to subprojects (e.g. clang) to a specific version of llvm
proper. This means that when you switch between two branches that
contain changes to clang, you'll automatically check out the right
llvm bits.
Although we can do this with submodules too, a single repository makes
it much easier.
As a concrete example, suppose you are working on some changes in
clang. You want to commit the changes, then switch to a new branch
based on tip of head and make some new changes. Finally you want to
switch back to your original branch. And when you switch between
branches, you want to get an llvm that's in sync with the clang in
your working copy.
Here's how I'd do it with a monolithic git repository, option (b):
git commit # old-branch
git fetch
git checkout -b new-branch origin/master
# hack hack hack
git commit # new-branch
git checkout old-branch
Here's how I'd do it with option (a), submodules. I've used git -C
here to make it explicit which repo we're working in, but in real life
I'd probably use cd.
# First, commit to two branches, one in your clang repo and one in your
# master repo.
git -C tools/clang commit # old-branch, clang submodule
git commit # old-branch, master repo
# Now fetch the submodule and check out head. Start a new branch in the
# umbrella repo.
git submodule foreach fetch
git checkout -b origin/master new-branch
git submodule update
# Start a new branch in the clang repo pointing to the current head.
git checkout -b -C tools/clang new-branch
# hack hack hack
# Commit both branches.
git commit -C tools/clang # new-branch
git commit # new-branch
# Check out the old branch.
git checkout old-branch
git submodule update
This is twice as many git commands, and almost three times as much
typing, to do the same thing.
Indeed, this is so complicated I expect that many developers wouldn't
bother, and will continue to develop the way we currently do. They
would thus continue to be unable to create clang branches that include
an llvm revision. :(
There are real simplifications and productivity advantages to be had
by using a single repository. They will affect essentially every
developer who makes changes to subprojects other than LLVM proper,
cares about release branches, bisects our code, or builds old
revisions.
So that's the first part, what we have to gain by using a monolithic
repository. Let's address the downsides.
If you'll bear with a hypothetical: Imagine you could somehow make the
monolithic repository behave exactly like the N separate repositories
work today. If so, that would be the best of both worlds: Those of us
who want a monolithic repository could have one, and those of us who
don't would be unaffected. Whatever downsides you were worried about
would evaporate in a mist of rainbows and puppies.
It turns out this hypothetical is very close to reality. The key is
git sparse checkouts [1], which let you check out only some files or
directories from a repository. Using this facility, if you don't like
the switch to a monolithic repository, you can set up your git so
you're (almost) entirely unaffected by it.
If you want to check out only llvm and clang, no problem. Just set up
your .git/info/sparse-checkout file appropriately. Done.
If you want to be able to have two different revisions of llvm and
clang checked out at once (maybe you want to update your clang bits
more often than you update your llvm bits), you can do that too. Make
one sparse checkout just of llvm, and make another sparse checkout
just of clang. Symlink the clang checkout to llvm/tools/clang.
That's it. The two checkouts can even share a common .git dir, so you
don't have to fetch and store everything twice.
As far as I can tell, the only overhead of the monolithic repository
is the extra storage in .git. But this is quite small in the scheme
of things.
The .git dir for the existing monolithic repository [2] is 1.2GB. By
way of comparison, my objdir for a release build of llvm and clang is
3.5G, and a full checkout (workdir + .git dirs) of llvm and clang is
0.65G.
If the 1.2G really is a problem for you (or more likely, your
automated infrastructure), a shallow clone [3] takes this down to 90M.
The critical point to me in all this is that it's easy to set up the
monolithic repository to appear like it's a bunch of separate repos.
But it is impossible, insofar as I can tell, to do the opposite. That
is, option (b) is strictly more powerful than option (a).
Renato has understandably pointed out that the current proposal is
pretty far along, so please speak up now if you want to make this
happen. I think we can.
Regards,
-Justin
[1] Git sparse checkouts were introduced in git 1.7, in 2010. For more
info, see http://jasonkarns.com/blog/subdirectory-checkouts-with-git-sparse-checkout/.
As far as I can tell, sparse checkouts work fine on Windows, but you
have to use git-bash, see http://stackoverflow.com/q/23289006.
[2] https://github.com/llvm-project/llvm-project
[3] git clone --depth=1 https://github.com/llvm-project/llvm-project.git
From: "Richard Smith via llvm-dev" <llvm...@lists.llvm.org>
To: "Justin Lebar" <jle...@google.com>
Cc: "llvm-dev" <llvm...@lists.llvm.org>
Sent: Friday, July 22, 2016 3:08:18 PM
Subject: Re: [llvm-dev] [RFC] One or many git repositories?
Having read through the entire thread and thought about this for a while, here are my thoughts:* A single monolithic repository has quite a lot of advantages, some because of what it is (for instance, you can make atomic cross-project commits), and some because of what it isn't (keeping the repositories separate creates synchronization problems for version-locked components, and it's not clear to me that we have a good answer for these problems)* A single repository from which we can build a complete LLVM toolchain, without requiring checking out a dozen components in seemingly-random locations, would be valuable. The default behavior for someone checking out and building the LLVM project should be that they get a complete, fully-functional toolchain.* We need to preserve and maintain the easy ability to mix and match LLVM components with other components (other C runtime libraries, C++ ABI libraries, C++ standard libraries, linkers, debuggers, ...). That means that it needs to be obvious what the boundaries of the optional components are, which means that the current project layout (the one implied by the build system) is not good enough for a monolithic repository (LLVM tests will fail if you don't check out llvm/tools/opt, but we presumably want to explicitly support not checking out llvm/tools/clang) -- unless we have extensive documentation covering this, and even then there are likely to be discoverability issues.However, the move to git and the reorganization need not be done at the same time, and it seems vastly easier to reorganize *after* we move to a monolithic git repository -- it would then be essentially trivial for each person with organizational ideas to move the code around in their monolithic git repository, push it somewhere where we can all look at it, and for us to then make an informed choice about the layout, with a concrete example in front of us. Then we push the selected new layout; git supports this really nicely if all the parts are already in a single repository.So here's what I would suggest:- we move to a monolithic git repository on github- this monolithic repository contains all the LLVM subprojects necessary to build a complete toolchain, including libc++ and other pieces that are not version-locked to llvm or clang- the initial structure exactly matches the current layout implied by the build system (clang in tools/clang, lld in tools/lld, compiler-rt in runtimes/compiler-rt, libc++ in projects/libcxx, and so on)- after we transition to git, interested parties assemble and upload to github patches reorganizing the project structure, and we have another discussion about principles for the restructuring (including forming solid guidance for how to organize future additions to LLVM), with reference to the patches so we can look at the proposed new layout; we pick one and commit it
From: "Piotr Padlewski via llvm-dev" <llvm...@lists.llvm.org>
To: "Richard Smith" <ric...@metafoo.co.uk>
Cc: "llvm-dev" <llvm...@lists.llvm.org>
Sent: Friday, July 22, 2016 3:18:31 PM
Subject: Re: [llvm-dev] [RFC] One or many git repositories?
I have one reasone why we should not moe to monolithic repository - If you do some light stuff like clang-tidy, that don't often require syncing with clang, but you still want to have the most recent checks, then I don't see a solution in monolithic repository.And this is a real issue if you only have 2 or 4 core laptop to do work.And I guess the the build system won't solve the problem, just a small change in some llvm file will result in recompiling many files that clang-tidy depends on.
Having read through the entire thread and thought about this for a while, here are my thoughts:* A single monolithic repository has quite a lot of advantages, some because of what it is (for instance, you can make atomic cross-project commits), and some because of what it isn't (keeping the repositories separate creates synchronization problems for version-locked components, and it's not clear to me that we have a good answer for these problems)* A single repository from which we can build a complete LLVM toolchain, without requiring checking out a dozen components in seemingly-random locations, would be valuable. The default behavior for someone checking out and building the LLVM project should be that they get a complete, fully-functional toolchain.* We need to preserve and maintain the easy ability to mix and match LLVM components with other components (other C runtime libraries, C++ ABI libraries, C++ standard libraries, linkers, debuggers, ...). That means that it needs to be obvious what the boundaries of the optional components are, which means that the current project layout (the one implied by the build system) is not good enough for a monolithic repository (LLVM tests will fail if you don't check out llvm/tools/opt, but we presumably want to explicitly support not checking out llvm/tools/clang) -- unless we have extensive documentation covering this, and even then there are likely to be discoverability issues.However, the move to git and the reorganization need not be done at the same time, and it seems vastly easier to reorganize *after* we move to a monolithic git repository -- it would then be essentially trivial for each person with organizational ideas to move the code around in their monolithic git repository, push it somewhere where we can all look at it, and for us to then make an informed choice about the layout, with a concrete example in front of us. Then we push the selected new layout; git supports this really nicely if all the parts are already in a single repository.So here's what I would suggest:- we move to a monolithic git repository on github- this monolithic repository contains all the LLVM subprojects necessary to build a complete toolchain, including libc++ and other pieces that are not version-locked to llvm or clang- the initial structure exactly matches the current layout implied by the build system (clang in tools/clang, lld in tools/lld, compiler-rt in runtimes/compiler-rt, libc++ in projects/libcxx, and so on)- after we transition to git, interested parties assemble and upload to github patches reorganizing the project structure, and we have another discussion about principles for the restructuring (including forming solid guidance for how to organize future additions to LLVM), with reference to the patches so we can look at the proposed new layout; we pick one and commit itThe goal would be to have the new layout entirely settled by the time 4.0 branches.
I think that we should still keep the test-suite in a separate repository (both because it is very large, should be even larger, and because it follows a very different licensing policy).
> If you do some light stuff like clang-tidy, that don't often require syncing with clang, but you still want to have the most recent checks, then I don't see a solution in monolithic repository.
Please see my original e-mail, in the paragraph that begins "If you
want to be able to have two different revisions of llvm and clang
checked out at once".
This describes a workflow that would allow you to update clang-tidy
without updating all of llvm. I think this would address the issue
you raise.
I grant that setting this up would require a one-time but nonzero
amount of work from developers like you. But then the question is
whether we should optimize for this one-time advantage for a few
developers or advantages for the vast majority of us that affect our
work every day.
-Justin
The build system can help, you just need to have two (sparse) checkout: one for LLVM/clang and the other for clang-tidy, and configure the build with the LLVM/clang checkout adding the clang-tidy as external.Mehdi
> Can you describe it more?
Something like this
$ git clone g...@github.com:llvm/llvm.git
$ git clone g...@github.com:llvm/llvm.git clang-tools-extras
$ <make the clang-tools sparse if you want, doesn't seem strictly
necessary though>
$ mkdir build && cd build
$ cmake ../llvm
-DLLVM_EXTERNAL_CLANG_TOOLS_EXTRA_SOURCE_DIR=../clang-tools-extras/tools/clang/tools/clang-tools-extras
$ make
> I don't get the approach, but it seems we are trying to make it easier to use llvm, but in the same time we are making it harder.
As Justin said, this isn't an issue for the majority of developers and
it's a solvable problem for you.
> BTW Does anyone knows why cmake is reloading each time I update llvm/clang repo? I hope that both approaches would solve this problem, because it doesn't seem like a something that should happen.
It reconfigures every time a CMakelists.txt file used in the build is
changed, which is unavoidable as far as I'm aware.
Cheers.
Tim.