We've made a great deal of progress toward producing the modularized
history that people seem to want, but we need to sort out the last
steps.
KDE did this process by using
http://techbase.kde.org/Projects/MoveToGit/UsingSvn2Git. To use that
tool, a modularizer creates a "ruleset" that describes how to extract
one particular module. We've done some experiments with that and it
seems to work, but it has segfaulted on Boost's early history (somewhere
in the first 100 commits). Might be reasonably easy to fix. It's
biggest other weakness is that it doesn't guarantee that every commit
and file is accounted for. If no ruleset matches a particular commit,
it will never end up in Git.
Here follows the current status of /my/ recent work.
## We have a Jenkins installation at
http://jenkins.boost.org
[When you visit Jenkins you may be asked to sign in at GitHub and you
will have authorize our Jenkins to access your public GitHub info --
reasons at
https://issues.jenkins-ci.org/browse/JENKINS-16347]
It is doing three things continuously ATM:
* Mirroring Boost's SVN locally
- Code:
https://github.com/ryppl/svn-log-mirror
- Jenkins:
http://jenkins.boost.org/job/svn-log-mirror
* Building John Wiegley's subconvert tool (see below for details)
- Code:
https://github.com/ryppl/subconvert
- Jenkins:
http://jenkins.boost.org/job/subconvert/
* Running subconvert on our SVN mirror
- Code:
https://github.com/ryppl/subconvert/blob/master/bin/update.sh
- Jenkins:
http://jenkins.boost.org/job/modularize/
## Things subconvert does:
* Checks through the SVN history to make sure:
1. every path is assigned a destination in
https://github.com/ryppl/subconvert/blob/master/doc/branches.txt,
which tells
- Is it a tag or branch?
- Path in SVN
- corresponding git symbolic reference
(IIUC, that file's 2nd, 3rd, and 4th columns have no effect)
2. Every SVN committer has an entry in
https://github.com/ryppl/subconvert/blob/master/doc/authors.txt
* Uses that information to produce a Git replica of SVN that gets pushed
to
https://github.com/ryppl/boost-history
* Uses
https://github.com/ryppl/subconvert/blob/master/doc/modules.txt
(just a copy of Daniel Pfeifer's work at
https://github.com/ryppl/boost-modularize/blob/master/develop.txt)
to sort each commit into separate modules
* Pushes those modules up to
https://github.com/boostorg/
(I have gotten everything except this step to work; by the time you
read this it will likely be done).
## Potential issues with subconvert
* It never creates a merge in Git. The commits are still there; they
just don't have two parents. For example,
https://github.com/ryppl/boost-history/commit/f81a9c4e855. The KDE
tool doesn't have this problem.
* It's pretty fragile. See the end of
http://jenkins.boost.org/job/modularize/33/console for one mysterious
example among many
* It's quite slow, and that's *with* a 14G ram disk
* It doesn't do anything about common svn properties that may have
meaning in the Git world, e.g. svn:eol-style
* It works completely based on source path, without considering source
revision. So if you have an SVN path that needs to be treated in one
way at one point in history and differently at another point, you'd
need to extend subconvert. (The KDE tool doesn't have this problem).
## Handling Merges
Unlike Git, which merges history, an SVN merge simply records the
information about which individual revisions have been incorporated. In
that sense, an SVN merge is like a "squashed cherrypick".
However, when *all* the previously-unmerged revisions on a given branch
are squashed into another branch, that *does* correspond to a real
merge. Many things that don't look like real merges in the monolithic
conversion could easily be real merges in a modularized repo.
In practice, merges are recorded in Boost's SVN in three ways at various
points in time, sometimes in combination, probably including "the
empty combination":
1. In log comments
2. In the svnmerge property created by
http://www.orcaware.com/svn/wiki/Svnmerge.py
3. In the svn:merge property created by svn 1.5+
Possible approaches:
1. don't worry about it; don't attempt to recreate merges
2. Attempt to extract the information and:
a. Find the highest SVN revision involved in any apparent merge and
make its Git commit a parent
b. if necessary, attempt to merge a rewritten history of the source
branch that contains only the cherrypicked commits.
## Recommendations
* We should fix the bug in the KDE tool and use that. Reasons:
- The tool and the approach have been proven both technically and
socially by another project, larger than Boost
- It handles merges
- It isn't a crazy resource hog
- It allows individuals to develop conversions a submodule-at-a-time
* We should plunder subconvert and boost-modularize
They both contain useful information in their mapping files. We
should use those to generate initial versions, at least, of the input
files used by the KDE tool. We should use subconvert (or something
similar based on
https://github.com/ryppl/svndumpparse) to ensure that
no commits or committers are being dropped on the floor
* With regard to merge handling, we should make a tool that, on a
module-by-module basis, uses approach 2b above to give modularizers a
guide to commits worthy of inspection in the resulting Git repo.
Your thoughts welcomed.
Incidentally, I'm running out of time to work on this project before I
start at Apple in February. No pressure, though ;-)
--
Dave Abrahams
BoostPro Computing Software Development Training
http://www.boostpro.com Clang/LLVM/EDG Compilers C++ Boost