Some background on reposurgeon and my Go translation problem

Eric Raymond

unread,

Aug 30, 2018, 11:43:41 PM8/30/18

to golang-nuts

There's been enough interest here in the technical questions I've been raising recently that a bit of a backgrounder seems in order.

Back in 2010 I noticed that git's fast-import stream format opened up some possibilities its designers probably hadn't anticipated. Their original motivation was to make it easy to write exporters from other version-control systems. Their offer was this: here's a flat-file format that can serialize the entire state of a git repository. If you can dump your repository state from random version control system X in this format, we can reanimate the history in git.

This was a very clever idea, and lots of people proceeded to write exporters on this model. A few VCS implementers, noticing that this implied a vast one way traffic of user attention away from them and towards git, wrote importers that could consume fast-export streams from git back to random version-control system X. I noticed that the effect was to turn git stream dumps into a de-facto exchange standard for version-control histories.

One of my quirks is that I like thinking about version-control systems and the tools around them. I had been noticing for years that most repository conversion tools are pretty bad in a specific way. They tend to try to over-automate the process, producing brute-force conversions full of crufty artifacts and minor defects around the places where the data models of source and target systems don't quite match. There were, at the time, no tools fit for a human to fix these problems.

Reposurgeon implements - in Python - a domain-specific language for describing surgical operation on repositories. It can be run in batch mode or as an interactive structure editor for doing forensics on damaged repository metadata. It works by calling a front end to get a stream dump of the repository you want to edit, deserializing the dump into an attributed graph, supporting a full repertoire of operations on that graph, and then writing the result out as a stream dump fed to an importer for the target system.

The target system can be the same as the source. Or a different one. There are front-end/back-end pairs to support RCS, CVS, Subversion, git, mercurial, darcs, monotone, bitkeeper, bzr, and src. Not all of these combinations are well-tested, but moves from CVS, Subversion, git, bzr, and mercurial to git or mercurial are pretty solid.

My conjecture that a human-driven tool with good exploratory capabilities would produce higher-quality history translations than fully-automated converters rapidly proved correct, rather dramatically so in fact. Reposurgeon has been the key tool for a great many high-complexity, high-risk history translations. Probably the most consequential single success was moving the history of GNU Emacs from bzr to git, cleaning up a lot of ancient cruft from RCS and CVS along the way.

Below the size of GCC, which IIRC was around 40K commits, Python gave me reasonable turnaround times. This is important if you need to do a lot of exploration to find a good conversion recipe, which you always do with these large old histories. But there was an adverse-selection effect. The average size of the histories people wanted by to convert kept increasing. Eventually I ended up designing a semi-custom PC optimized for the job load, with high CPU-to-memory bandwidth and beefy primary caches to enable it to handle large working sets - graph-theory problems gigabytes wide. Its fans call it the Great Beast, and three years after it was built you still can't spec a machine that performs better from COTS parts.

(That may change soon. The guy who actually put togeether the Beast for me is contemplating an upgrade based on the Cascade Lake chipset due out from Intel in Q4. His plan is to build one for me and another for Linus Torvalds. The clever fellow has sponsors lined up a block long to be in the build video.)

Then came GCC. The GCC repository is over 259K commits. It brings the Great Beast to its knees. Minimum 9-hour times for test conversions, which is intolerable. I concluded that Python doesn't just cut it at this scale. I then shopped for a new language pretty carefully before choosing Go. Compiled Lisp was a contender, and I even briefly considered OCaml. Go won in part because the semantic distance from Python is not all that large, considering the whole late-binding issue.

Python reposurgeon is about 14KLOC of code. In six days I've translated about 11% of it, building unit tests as I go. (There's already a very strong set of functional and end-to-end tests. There has to be; I wouldn't dare modify it otherwise. To say it's become algorithmically dense enough to be tricky to change would be to wallow in understatement...)

However, this is only my second Go project. My first was a finger exercise, a re-implementation of David A. Wheeler's sloccount tool. So I'm in a weird place where I'm translating rather advanced Python into rather advanced Go (successfully, as much as I can tell before the end-to-end tests) but I still have newbie questions.

I translated an auxiliary tool called repocutter (it slices and dices Subversion dump streams) as a warm-up. This leads me to believe I can expect about a 40x speedup. That's worth playing for, even before I do things like exploiting concurrency for faster searches,

Jan Mercl

unread,

Aug 31, 2018, 4:30:08 AM8/31/18

to Eric Raymond, golang-nuts

On Fri, Aug 31, 2018 at 5:43 AM Eric Raymond <e...@thyrsus.com> wrote:

> I translated an auxiliary tool called repocutter (it slices and dices Subversion dump streams) as a warm-up. This leads me to believe I can expect about a 40x speedup.

> That's worth playing for, even before I do things like exploiting concurrency for faster searches,

In case you haven't heard it before, Google was thinking on the same lines and released Grumpy last year: https://opensource.googleblog.com/2017/01/grumpy-go-running-python.html. I never used the tool and it may possibly not even support your large Python program(s). But maybe it could be useful one way or another.

--

-j

Jan Mercl

unread,

Aug 31, 2018, 4:37:08 AM8/31/18

to Eric Raymond, golang-nuts

On Fri, Aug 31, 2018 at 10:29 AM Jan Mercl <0xj...@gmail.com> wrote:

> https://opensource.googleblog.com/2017/01/grumpy-go-running-python.html.

The linked above blog post links to http://grump.io/ but it seems to be a dead link for me ATM. The project repository is here: https://github.com/google/grumpy.

--

-j

Sebastien Binet

unread,

Aug 31, 2018, 4:43:31 AM8/31/18

to Jan Mercl, e...@thyrsus.com, golang-nuts

note that this repo is... dormant.

the new home for grumpy is now:

- https://github.com/grumpyhome/grumpy

-s

Eric Raymond

unread,

Aug 31, 2018, 5:48:38 AM8/31/18

to golang-nuts

On Friday, August 31, 2018 at 4:30:08 AM UTC-4, Jan Mercl wrote:

In case you haven't heard it before, Google was thinking on the same lines and released Grumpy last year: https://opensource.googleblog.com/2017/01/grumpy-go-running-python.html. I never used the tool and it may possibly not even support your large Python program(s). But maybe it could be useful one way or another.

I did experiment with that. Unfortunately, the source code it produced when I ran it on my stuff was an unmaintainable mess. Possibly correct, I don't know - but I couldn't trust it because I can't read it.

Jan Mercl

unread,

Aug 31, 2018, 6:22:41 AM8/31/18

to Eric Raymond, golang-nuts

On Fri, Aug 31, 2018 at 11:48 AM Eric Raymond <e...@thyrsus.com> wrote:

> I did experiment with that. Unfortunately, the source code it produced when I ran it on my stuff was an unmaintainable mess.

> Possibly correct, I don't know - but I couldn't trust it because I can't read it.

Yup, not surprising considering source-to-source compilers do not target human readers. But at least, even the unreadable code can provide, if it works, some idea about the speedup one can expect from the manual translation.