Producing Modularized History

Dave Abrahams

unread,

Jan 15, 2013, 12:30:23 PM1/15/13

to Daniel Pfeifer, John Wiegley, Ryppl Developers

We've made a great deal of progress toward producing the modularized
history that people seem to want, but we need to sort out the last
steps.

KDE did this process by using
http://techbase.kde.org/Projects/MoveToGit/UsingSvn2Git. To use that
tool, a modularizer creates a "ruleset" that describes how to extract
one particular module. We've done some experiments with that and it
seems to work, but it has segfaulted on Boost's early history (somewhere
in the first 100 commits). Might be reasonably easy to fix. It's
biggest other weakness is that it doesn't guarantee that every commit
and file is accounted for. If no ruleset matches a particular commit,
it will never end up in Git.

Here follows the current status of /my/ recent work.

## We have a Jenkins installation at http://jenkins.boost.org

[When you visit Jenkins you may be asked to sign in at GitHub and you
will have authorize our Jenkins to access your public GitHub info --
reasons at https://issues.jenkins-ci.org/browse/JENKINS-16347]

It is doing three things continuously ATM:

* Mirroring Boost's SVN locally
- Code: https://github.com/ryppl/svn-log-mirror
- Jenkins: http://jenkins.boost.org/job/svn-log-mirror
* Building John Wiegley's subconvert tool (see below for details)
- Code: https://github.com/ryppl/subconvert
- Jenkins: http://jenkins.boost.org/job/subconvert/
* Running subconvert on our SVN mirror
- Code: https://github.com/ryppl/subconvert/blob/master/bin/update.sh
- Jenkins: http://jenkins.boost.org/job/modularize/

## Things subconvert does:

* Checks through the SVN history to make sure:
1. every path is assigned a destination in
https://github.com/ryppl/subconvert/blob/master/doc/branches.txt,
which tells
- Is it a tag or branch?
- Path in SVN
- corresponding git symbolic reference
(IIUC, that file's 2nd, 3rd, and 4th columns have no effect)
2. Every SVN committer has an entry in
https://github.com/ryppl/subconvert/blob/master/doc/authors.txt
* Uses that information to produce a Git replica of SVN that gets pushed
to https://github.com/ryppl/boost-history
* Uses https://github.com/ryppl/subconvert/blob/master/doc/modules.txt
(just a copy of Daniel Pfeifer's work at
https://github.com/ryppl/boost-modularize/blob/master/develop.txt)
to sort each commit into separate modules
* Pushes those modules up to https://github.com/boostorg/
(I have gotten everything except this step to work; by the time you
read this it will likely be done).

## Potential issues with subconvert

* It never creates a merge in Git. The commits are still there; they
just don't have two parents. For example,
https://github.com/ryppl/boost-history/commit/f81a9c4e855. The KDE
tool doesn't have this problem.
* It's pretty fragile. See the end of
http://jenkins.boost.org/job/modularize/33/console for one mysterious
example among many
* It's quite slow, and that's *with* a 14G ram disk
* It doesn't do anything about common svn properties that may have
meaning in the Git world, e.g. svn:eol-style
* It works completely based on source path, without considering source
revision. So if you have an SVN path that needs to be treated in one
way at one point in history and differently at another point, you'd
need to extend subconvert. (The KDE tool doesn't have this problem).

## Handling Merges

Unlike Git, which merges history, an SVN merge simply records the
information about which individual revisions have been incorporated. In
that sense, an SVN merge is like a "squashed cherrypick".

However, when *all* the previously-unmerged revisions on a given branch
are squashed into another branch, that *does* correspond to a real
merge. Many things that don't look like real merges in the monolithic
conversion could easily be real merges in a modularized repo.

In practice, merges are recorded in Boost's SVN in three ways at various
points in time, sometimes in combination, probably including "the
empty combination":

1. In log comments
2. In the svnmerge property created by
http://www.orcaware.com/svn/wiki/Svnmerge.py
3. In the svn:merge property created by svn 1.5+

Possible approaches:

1. don't worry about it; don't attempt to recreate merges
2. Attempt to extract the information and:
a. Find the highest SVN revision involved in any apparent merge and
make its Git commit a parent
b. if necessary, attempt to merge a rewritten history of the source
branch that contains only the cherrypicked commits.

## Recommendations

* We should fix the bug in the KDE tool and use that. Reasons:

- The tool and the approach have been proven both technically and
socially by another project, larger than Boost
- It handles merges
- It isn't a crazy resource hog
- It allows individuals to develop conversions a submodule-at-a-time

* We should plunder subconvert and boost-modularize

They both contain useful information in their mapping files. We
should use those to generate initial versions, at least, of the input
files used by the KDE tool. We should use subconvert (or something
similar based on https://github.com/ryppl/svndumpparse) to ensure that
no commits or committers are being dropped on the floor

* With regard to merge handling, we should make a tool that, on a
module-by-module basis, uses approach 2b above to give modularizers a
guide to commits worthy of inspection in the resulting Git repo.

Your thoughts welcomed.

Incidentally, I'm running out of time to work on this project before I
start at Apple in February. No pressure, though ;-)

--
Dave Abrahams
BoostPro Computing Software Development Training
http://www.boostpro.com Clang/LLVM/EDG Compilers C++ Boost

Beman Dawes

unread,

Jan 15, 2013, 4:19:06 PM1/15/13

to rypp...@googlegroups.com, Daniel Pfeifer, John Wiegley

On Tue, Jan 15, 2013 at 12:30 PM, Dave Abrahams <da...@boostpro.com> wrote:
>
> We've made a great deal of progress toward producing the modularized
> history that people seem to want, but we need to sort out the last
> steps.

Excellent! Thanks for all the hard work!

> KDE did this process by using
> http://techbase.kde.org/Projects/MoveToGit/UsingSvn2Git. To use that
> tool, a modularizer creates a "ruleset" that describes how to extract
> one particular module. We've done some experiments with that and it
> seems to work, but it has segfaulted on Boost's early history (somewhere
> in the first 100 commits). Might be reasonably easy to fix. It's
> biggest other weakness is that it doesn't guarantee that every commit
> and file is accounted for. If no ruleset matches a particular commit,
> it will never end up in Git.
>
> Here follows the current status of /my/ recent work.
>

>...

> Possible approaches:
>
> 1. don't worry about it; don't attempt to recreate merges

I have a hard time knowing if this is important to Boost libraries.
The cases in Boost.Filesystem where it might have mattered would have
been the merging of the version 2 and version 3 branches. But I could
not get SVN to do real merges in complex cases like those, so just did
the kind of merge that creates a diff between two branches and then
applies it to one of them (i.e. trunk). Same for merges from trunk to
release. So svn history across those merges isn't so good anyhow. I
don't expect the Git conversion to see those merges as anything more
than the application of a big diff.

> a. Find the highest SVN revision involved in any apparent merge and
> make its Git commit a parent
> b. if necessary, attempt to merge a rewritten history of the source
> branch that contains only the cherrypicked commits.
>
> ## Recommendations
>
> * We should fix the bug in the KDE tool and use that. Reasons:
>
> - The tool and the approach have been proven both technically and
> socially by another project, larger than Boost
> - It handles merges
> - It isn't a crazy resource hog
> - It allows individuals to develop conversions a submodule-at-a-time

How do you see submodule-at-a-time conversion being used?

> * We should plunder subconvert and boost-modularize
>
> They both contain useful information in their mapping files. We
> should use those to generate initial versions, at least, of the input
> files used by the KDE tool. We should use subconvert (or something
> similar based on https://github.com/ryppl/svndumpparse) to ensure that
> no commits or committers are being dropped on the floor
>
> * With regard to merge handling, we should make a tool that, on a
> module-by-module basis, uses approach 2b above to give modularizers a
> guide to commits worthy of inspection in the resulting Git repo.
>
> Your thoughts welcomed.

Stepping back to look at the big picture, are you suggesting that we
go ahead with the http://github.com/boost-lib conversion as scheduled,
and then use the KDE tool as recommended above to graft on the
history?

> Incidentally, I'm running out of time to work on this project before I
> start at Apple in February. No pressure, though ;-)

Did I miss something:-?

--Beman

Dave Abrahams

unread,

Jan 15, 2013, 4:36:06 PM1/15/13

to rypp...@googlegroups.com, Daniel Pfeifer, John Wiegley

on Tue Jan 15 2013, Beman Dawes <bdawes-AT-acm.org> wrote:

> On Tue, Jan 15, 2013 at 12:30 PM, Dave Abrahams <da...@boostpro.com> wrote:
>>
>> KDE did this process by using
>> http://techbase.kde.org/Projects/MoveToGit/UsingSvn2Git. To use that
>> tool, a modularizer creates a "ruleset" that describes how to extract
>> one particular module. We've done some experiments with that and it
>> seems to work, but it has segfaulted on Boost's early history (somewhere
>> in the first 100 commits). Might be reasonably easy to fix. It's
>> biggest other weakness is that it doesn't guarantee that every commit
>> and file is accounted for. If no ruleset matches a particular commit,
>> it will never end up in Git.
>>
>> Here follows the current status of /my/ recent work.

Incidentally, the final push of submodules ran into a snag: Github
doesn't like sudden changes to its "current branch." I think we may
actually have to automate the deletion/re-creation of the repository
:-(. Or push somewhere else (e.g. Bitbucket).

>> Possible approaches:
>>
>> 1. don't worry about it; don't attempt to recreate merges
>
> I have a hard time knowing if this is important to Boost libraries.
> The cases in Boost.Filesystem where it might have mattered would have
> been the merging of the version 2 and version 3 branches. But I could
> not get SVN to do real merges in complex cases like those, so just did
> the kind of merge that creates a diff between two branches and then
> applies it to one of them (i.e. trunk). Same for merges from trunk to
> release. So svn history across those merges isn't so good anyhow. I
> don't expect the Git conversion to see those merges as anything more
> than the application of a big diff.

I'm afraid others won't be so forgiving as you are about missed merges,
but I could be mistaken.

>> a. Find the highest SVN revision involved in any apparent merge and
>> make its Git commit a parent
>> b. if necessary, attempt to merge a rewritten history of the source
>> branch that contains only the cherrypicked commits.
>>
>> ## Recommendations
>>
>> * We should fix the bug in the KDE tool and use that. Reasons:
>>
>> - The tool and the approach have been proven both technically and
>> socially by another project, larger than Boost
>> - It handles merges
>> - It isn't a crazy resource hog
>> - It allows individuals to develop conversions a submodule-at-a-time
>
> How do you see submodule-at-a-time conversion being used?

Sort of the way KDE did it. We'd continuously automate it, and let
people tweak the rulesets until they were happy with the results. I
just imagine that coming up with some initial rules could be beneficial.

>> * We should plunder subconvert and boost-modularize
>>
>> They both contain useful information in their mapping files. We
>> should use those to generate initial versions, at least, of the input
>> files used by the KDE tool. We should use subconvert (or something
>> similar based on https://github.com/ryppl/svndumpparse) to ensure that
>> no commits or committers are being dropped on the floor
>>
>> * With regard to merge handling, we should make a tool that, on a
>> module-by-module basis, uses approach 2b above to give modularizers a
>> guide to commits worthy of inspection in the resulting Git repo.
>>
>> Your thoughts welcomed.
>
> Stepping back to look at the big picture, are you suggesting that we
> go ahead with the http://github.com/boost-lib conversion as scheduled,
> and then use the KDE tool as recommended above to graft on the
> history?
>
>> Incidentally, I'm running out of time to work on this project before I
>> start at Apple in February. No pressure, though ;-)
>
> Did I miss something:-?

Guess you must've :-)

Dave Abrahams

unread,

Jan 17, 2013, 12:20:53 PM1/17/13

to rypp...@googlegroups.com, Daniel Pfeifer, John Wiegley, Beman Dawes

on Tue Jan 15 2013, Dave Abrahams <dave-AT-boostpro.com> wrote:

> on Tue Jan 15 2013, Beman Dawes <bdawes-AT-acm.org> wrote:
>
>> On Tue, Jan 15, 2013 at 12:30 PM, Dave Abrahams <da...@boostpro.com> wrote:
>>>
>>> KDE did this process by using
>>> http://techbase.kde.org/Projects/MoveToGit/UsingSvn2Git. To use that
>>> tool, a modularizer creates a "ruleset" that describes how to extract
>>> one particular module. We've done some experiments with that and it
>>> seems to work, but it has segfaulted on Boost's early history (somewhere
>>> in the first 100 commits). Might be reasonably easy to fix. It's
>>> biggest other weakness is that it doesn't guarantee that every commit
>>> and file is accounted for. If no ruleset matches a particular commit,
>>> it will never end up in Git.
>>>
>>> Here follows the current status of /my/ recent work.
>
> Incidentally, the final push of submodules ran into a snag: Github
> doesn't like sudden changes to its "current branch." I think we may
> actually have to automate the deletion/re-creation of the repository
> :-(. Or push somewhere else (e.g. Bitbucket).

Turns out that it's not producing any refs in the submodule, and I don't
know how to fix that. Anyway, I don't see a reason to keep trying to
make this code work if svn2git works. The most valuable part of this
work is in the branches.txt and authors.txt mapping files.

>>> Possible approaches:
>>>
>>> 1. don't worry about it; don't attempt to recreate merges
>>
>> I have a hard time knowing if this is important to Boost libraries.
>> The cases in Boost.Filesystem where it might have mattered would have
>> been the merging of the version 2 and version 3 branches. But I could
>> not get SVN to do real merges in complex cases like those, so just did
>> the kind of merge that creates a diff between two branches and then
>> applies it to one of them (i.e. trunk). Same for merges from trunk to
>> release. So svn history across those merges isn't so good anyhow. I
>> don't expect the Git conversion to see those merges as anything more
>> than the application of a big diff.
>
> I'm afraid others won't be so forgiving as you are about missed merges,
> but I could be mistaken.

The nice thing about the KDE approach is that each maintainer can tweak
the merge information for his project until he is satisfied.

>>> a. Find the highest SVN revision involved in any apparent merge and
>>> make its Git commit a parent
>>> b. if necessary, attempt to merge a rewritten history of the source
>>> branch that contains only the cherrypicked commits.
>>>
>>> ## Recommendations
>>>
>>> * We should fix the bug in the KDE tool and use that. Reasons:

Actually on closer inspection I don't think we have to fix it. The tool
is choking on revisions 1-5 from Boost's SVN, which are *completely
empty*. Revision 0 only contains the following, which can be safely
dumped on the floor. There's no code in here; just metadata that
wouldn't have a representation in Git anyhow.

--8<---------------cut here---------------start------------->8---
UUID cdea0ec9-6010-46c6-90a0-cdc77dcfc492
{ revision: 0
( Revision-number: 0)
( Prop-content-length: 221)
( Content-length: 221)
[ set-revision-property svn:sync-last-merged-rev=82504 ]
[ set-revision-property svn:date=2007-04-30T19:23:55.745280Z ]
[ set-revision-property svn:sync-from-url=http://svn.boost.org/svn/boost ]
[ set-revision-property svn:sync-from-uuid=b8fc166d-592f-0410-95f2-cb63ce0dd405 ]
}
--8<---------------cut here---------------end--------------->8---

So we just have to have pass "--resume-from 6" when invoking svn2git and
we should be OK.

>>> - The tool and the approach have been proven both technically and
>>> socially by another project, larger than Boost
>>> - It handles merges
>>> - It isn't a crazy resource hog
>>> - It allows individuals to develop conversions a submodule-at-a-time
>>
>> How do you see submodule-at-a-time conversion being used?
>
> Sort of the way KDE did it. We'd continuously automate it, and let
> people tweak the rulesets until they were happy with the results. I
> just imagine that coming up with some initial rules could be beneficial.
>
>>> * We should plunder subconvert and boost-modularize
>>>
>>> They both contain useful information in their mapping files. We
>>> should use those to generate initial versions, at least, of the input
>>> files used by the KDE tool. We should use subconvert (or something
>>> similar based on https://github.com/ryppl/svndumpparse) to ensure that
>>> no commits or committers are being dropped on the floor
>>>
>>> * With regard to merge handling, we should make a tool that, on a
>>> module-by-module basis, uses approach 2b above to give modularizers a
>>> guide to commits worthy of inspection in the resulting Git repo.
>>>
>>> Your thoughts welcomed.
>>
>> Stepping back to look at the big picture, are you suggesting that we
>> go ahead with the http://github.com/boost-lib conversion as scheduled,
>> and then use the KDE tool as recommended above to graft on the
>> history?

Sorry, somehow I let this question by without answering it. No, I'm not
suggesting any grafts. I am suggesting we go ahead by generating
initial svn2git rulesets and the resulting modularized history for
people to inspect, and give people a window during which to make any
modifications to their rulesets, during which the modularizations will
be continuously updated.

Dave Abrahams

unread,

Jan 17, 2013, 3:54:19 PM1/17/13

to Beman Dawes, rypp...@googlegroups.com, Daniel Pfeifer, John Wiegley

on Thu Jan 17 2013, Beman Dawes <bdawes-AT-acm.org> wrote:

> On Thu, Jan 17, 2013 at 12:20 PM, Dave Abrahams <da...@boostpro.com> wrote:
>>
>> on Tue Jan 15 2013, Dave Abrahams <dave-AT-boostpro.com> wrote:
>>
>>> on Tue Jan 15 2013, Beman Dawes <bdawes-AT-acm.org> wrote:

> ...

>>>>> Here follows the current status of /my/ recent work.
>>>
>>> Incidentally, the final push of submodules ran into a snag: Github
>>> doesn't like sudden changes to its "current branch." I think we may
>>> actually have to automate the deletion/re-creation of the repository
>>> :-(. Or push somewhere else (e.g. Bitbucket).
>>
>> Turns out that it's not producing any refs in the submodule, and I don't
>> know how to fix that. Anyway, I don't see a reason to keep trying to
>> make this code work if svn2git works. The most valuable part of this
>> work is in the branches.txt and authors.txt mapping files.
>

> Is it a big deal to translate those files into something svn2git understands?

No, I don't think so. At least, now that John has explained his formats
to me I think I can do it fairly easily, and I'm going to give it a
shot.

> ...>>>> Your thoughts welcomed.

>>>>
>>>> Stepping back to look at the big picture, are you suggesting that we
>>>> go ahead with the http://github.com/boost-lib conversion as scheduled,
>>>> and then use the KDE tool as recommended above to graft on the
>>>> history?
>>
>> Sorry, somehow I let this question by without answering it. No, I'm not
>> suggesting any grafts. I am suggesting we go ahead by generating
>> initial svn2git rulesets and the resulting modularized history for

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

>> people to inspect, and give people a window during which to make any

^^^^^^^^^^^^^^^^^

>> modifications to their rulesets, during which the modularizations

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> will be continuously updated.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Is this something that can be tested with a reasonable amount of
> effort? It will be hard for developers to evaluate the approach until
> they have actual repos with history to look at.

I don't see how one could interpret what I wrote above in any other way.
I'd set up a Jenkins job to take care of this and they'd be able to see
the results in repositories at http://github.com/boostorg... or possibly
at http://bitbucket.org/boostorg, since the history browsers there are
so much better.

Beman Dawes

unread,

Jan 17, 2013, 3:47:33 PM1/17/13

to Dave Abrahams, rypp...@googlegroups.com, Daniel Pfeifer, John Wiegley

On Thu, Jan 17, 2013 at 12:20 PM, Dave Abrahams <da...@boostpro.com> wrote:
>
> on Tue Jan 15 2013, Dave Abrahams <dave-AT-boostpro.com> wrote:
>
>> on Tue Jan 15 2013, Beman Dawes <bdawes-AT-acm.org> wrote:

...

>>>> Here follows the current status of /my/ recent work.
>>
>> Incidentally, the final push of submodules ran into a snag: Github
>> doesn't like sudden changes to its "current branch." I think we may
>> actually have to automate the deletion/re-creation of the repository
>> :-(. Or push somewhere else (e.g. Bitbucket).
>
> Turns out that it's not producing any refs in the submodule, and I don't
> know how to fix that. Anyway, I don't see a reason to keep trying to
> make this code work if svn2git works. The most valuable part of this
> work is in the branches.txt and authors.txt mapping files.

Is it a big deal to translate those files into something svn2git understands?

...>>>> Your thoughts welcomed.

>>>
>>> Stepping back to look at the big picture, are you suggesting that we
>>> go ahead with the http://github.com/boost-lib conversion as scheduled,
>>> and then use the KDE tool as recommended above to graft on the
>>> history?
>
> Sorry, somehow I let this question by without answering it. No, I'm not
> suggesting any grafts. I am suggesting we go ahead by generating
> initial svn2git rulesets and the resulting modularized history for
> people to inspect, and give people a window during which to make any
> modifications to their rulesets, during which the modularizations will
> be continuously updated.

Is this something that can be tested with a reasonable amount of
effort? It will be hard for developers to evaluate the approach until
they have actual repos with history to look at.

Thanks,

--Beman

Reply all

Reply to author

Forward