[PATCH 0 of 9] Performance enhancements for exporting to Git

135 views
Skip to first unread message

Gregory Szorc

unread,
Sep 23, 2012, 8:03:29 PM9/23/12
to hg-...@googlegroups.com
This series of patches is mostly centered around making exporting from
Mercurial to Git faster.

The initial patches are very small. I'm just pre-compiling a bunch of
regular expressions. This is considered a best-practice and will save
some CPU resources. Although, you probably won't be able to tell on the
wall time of conversions.

I also snuck a patch in there that verifies the tree and parent commits
exist in the Git repository before saving a new Git commit object. This
adds minimal overhead and catches a pretty obvious case of repository
corruption.

The big patch is at the end. I changed how Git trees are exported. The
new code is well-documented, so I won't describe it that much here. Just
know that there is a slight behavior change: blob IDs are no longer
saved to the ID mapping. Instead, the first time TreeTracker is fired
up,
it will export a blob it hasn't seen before, possibly redundantly with
something that's in the Git repo already. This shouldn't matter: Git
will happily figure things out on the next pack.

The new code passes the test suite. And, conversion of the actual
Mercurial repository yields identical commit hashes with the patches
applied. The only difference is it runs about 3x faster. Conversion of
mozilla-central also runs about 3x faster with this patch.

My next series of patches will center around doing tree export in
parallel. This should make things scale up to the number of cores in
your machine. The end of this patch series is a good stopping point
before I make this transition.

In my initial series of patches mailed to this list, I added versioning
of hg-git state. I'll probably re-submit these patches at some point. I
/might/ be a good idea to commit them before this patch series since I
deprecated storing blobs in the mapping file.

The patches in this series should apply cleanly on top of the next
branch.

Gregory Szorc

unread,
Sep 23, 2012, 8:03:30 PM9/23/12
to hg-...@googlegroups.com
# HG changeset patch
# User Gregory Szorc <gregor...@gmail.com>
# Date 1348280926 25200
# Node ID 5ca256907196d908a23dc72bf3230e975565d6e8
# Parent e152bdf5998098e135d15affd2a98c357d323b3f
Optimize get_git_author

Pre-compile regular expression. Prevent extra key lookup in author_map.

diff --git a/hggit/git_handler.py b/hggit/git_handler.py
--- a/hggit/git_handler.py
+++ b/hggit/git_handler.py
@@ -24,16 +24,18 @@
from mercurial.node import hex, bin, nullid
from mercurial import context, util as hgutil
from mercurial import error

import _ssh
import util
from overlay import overlayrepo

+RE_GIT_AUTHOR = re.compile('^(.*?) ?\<(.*?)(?:\>(.*))?$')
+
class GitProgress(object):
"""convert git server progress strings into mercurial progress"""
def __init__(self, ui):
self.ui = ui

self.lasttopic = None
self.msgbuf = ''

@@ -428,22 +430,20 @@
"""
return re.sub('[<>\n]', '?', name.lstrip('< ').rstrip('> '))

def get_git_author(self, ctx):
# hg authors might not have emails
author = ctx.user()

# see if a translation exists
- if author in self.author_map:
- author = self.author_map[author]
+ author = self.author_map.get(author, author)

# check for git author pattern compliance
- regex = re.compile('^(.*?) ?\<(.*?)(?:\>(.*))?$')
- a = regex.match(author)
+ a = RE_GIT_AUTHOR.match(author)

if a:
name = self.get_valid_git_username_email(a.group(1))
email = self.get_valid_git_username_email(a.group(2))
if a.group(3) != None and len(a.group(3)) != 0:
name += ' ext:(' + urllib.quote(a.group(3)) + ')'
author = self.get_valid_git_username_email(name) + ' <' + self.get_valid_git_username_email(email) + '>'
elif '@' in author:

Gregory Szorc

unread,
Sep 23, 2012, 8:03:31 PM9/23/12
to hg-...@googlegroups.com
# HG changeset patch
# User Gregory Szorc <gregor...@gmail.com>
# Date 1348281136 25200
# Node ID 58a14700666b1eb51d13bee03ce411da149104a7
# Parent 5ca256907196d908a23dc72bf3230e975565d6e8
Precompile Git URI regular expression

diff --git a/hggit/git_handler.py b/hggit/git_handler.py
--- a/hggit/git_handler.py
+++ b/hggit/git_handler.py
@@ -26,16 +26,23 @@
from mercurial import error

import _ssh
import util
from overlay import overlayrepo

RE_GIT_AUTHOR = re.compile('^(.*?) ?\<(.*?)(?:\>(.*))?$')

+# Test for git:// and git+ssh:// URI.
+# Support several URL forms, including separating the
+# host and path with either a / or : (sepr)
+RE_GIT_URI = re.compile(
+ r'^(?P<scheme>git([+]ssh)?://)(?P<host>.*?)(:(?P<port>\d+))?'
+ r'(?P<sepr>[:/])(?P<path>.*)$')
+
class GitProgress(object):
"""convert git server progress strings into mercurial progress"""
def __init__(self, ui):
self.ui = ui

self.lasttopic = None
self.msgbuf = ''

@@ -1288,24 +1295,17 @@
except UnicodeDecodeError:
return string.decode('ascii', 'replace').encode('utf-8')

def get_transport_and_path(self, uri):
# pass hg's ui.ssh config to dulwich
if not issubclass(client.get_ssh_vendor, _ssh.SSHVendor):
client.get_ssh_vendor = _ssh.generate_ssh_vendor(self.ui)

- # Test for git:// and git+ssh:// URI.
- # Support several URL forms, including separating the
- # host and path with either a / or : (sepr)
- git_pattern = re.compile(
- r'^(?P<scheme>git([+]ssh)?://)(?P<host>.*?)(:(?P<port>\d+))?'
- r'(?P<sepr>[:/])(?P<path>.*)$'
- )
- git_match = git_pattern.match(uri)
+ git_match = RE_GIT_URI.match(uri)
if git_match:
res = git_match.groupdict()
transport = client.SSHGitClient if 'ssh' in res['scheme'] else client.TCPGitClient
host, port, sepr, path = res['host'], res['port'], res['sepr'], res['path']
if sepr == '/':
path = '/' + path
# strip trailing slash for heroku-style URLs
# ssh+git://g...@heroku.com:project.git/

Gregory Szorc

unread,
Sep 23, 2012, 8:03:32 PM9/23/12
to hg-...@googlegroups.com
# HG changeset patch
# User Gregory Szorc <gregor...@gmail.com>
# Date 1348281417 25200
# Node ID 1de6cd07221e0d5fb90b30e515891a325574a9fc
# Parent 58a14700666b1eb51d13bee03ce411da149104a7
Precompile Git username sanitizing regular expression

diff --git a/hggit/git_handler.py b/hggit/git_handler.py
--- a/hggit/git_handler.py
+++ b/hggit/git_handler.py
@@ -26,16 +26,18 @@
from mercurial import error

import _ssh
import util
from overlay import overlayrepo

RE_GIT_AUTHOR = re.compile('^(.*?) ?\<(.*?)(?:\>(.*))?$')

+RE_GIT_SANITIZE_AUTHOR = re.compile('[<>\n]')
+
# Test for git:// and git+ssh:// URI.
# Support several URL forms, including separating the
# host and path with either a / or : (sepr)
RE_GIT_URI = re.compile(
r'^(?P<scheme>git([+]ssh)?://)(?P<host>.*?)(:(?P<port>\d+))?'
r'(?P<sepr>[:/])(?P<path>.*)$')

class GitProgress(object):
@@ -430,17 +432,17 @@
'jo...@doe.com'
>>> g(' <jo...@doe.com> ')
'jo...@doe.com'
>>> g(' <random<\n<garbage\n> > > ')
'random???garbage?'
>>> g('Typo in hgrc >but.h...@handles.it.gracefully>')
'Typo in hgrc ?but.h...@handles.it.gracefully'
"""
- return re.sub('[<>\n]', '?', name.lstrip('< ').rstrip('> '))
+ return RE_GIT_SANITIZE_AUTHOR.sub('?', name.lstrip('< ').rstrip('> '))

def get_git_author(self, ctx):
# hg authors might not have emails
author = ctx.user()

# see if a translation exists
author = self.author_map.get(author, author)

Gregory Szorc

unread,
Sep 23, 2012, 8:03:33 PM9/23/12
to hg-...@googlegroups.com
# HG changeset patch
# User Gregory Szorc <gregor...@gmail.com>
# Date 1348281593 25200
# Node ID fbb9ade686ffdd66ed30a7e9b68d9246409f7389
# Parent 1de6cd07221e0d5fb90b30e515891a325574a9fc
Precompile Git author extra data regular expression

diff --git a/hggit/git_handler.py b/hggit/git_handler.py
--- a/hggit/git_handler.py
+++ b/hggit/git_handler.py
@@ -28,16 +28,18 @@
import _ssh
import util
from overlay import overlayrepo

RE_GIT_AUTHOR = re.compile('^(.*?) ?\<(.*?)(?:\>(.*))?$')

RE_GIT_SANITIZE_AUTHOR = re.compile('[<>\n]')

+RE_GIT_AUTHOR_EXTRA = re.compile('^(.*?)\ ext:\((.*)\) <(.*)\>$')
+
# Test for git:// and git+ssh:// URI.
# Support several URL forms, including separating the
# host and path with either a / or : (sepr)
RE_GIT_URI = re.compile(
r'^(?P<scheme>git([+]ssh)?://)(?P<host>.*?)(:(?P<port>\d+))?'
r'(?P<sepr>[:/])(?P<path>.*)$')

class GitProgress(object):
@@ -702,18 +704,17 @@
text = '\n'.join([l.rstrip() for l in text.splitlines()]).strip('\n')
if text + '\n' != origtext:
extra['message'] = create_delta(text +'\n', origtext)

author = commit.author

# convert extra data back to the end
if ' ext:' in commit.author:
- regex = re.compile('^(.*?)\ ext:\((.*)\) <(.*)\>$')
- m = regex.match(commit.author)
+ m = RE_GIT_AUTHOR_EXTRA.match(commit.author)
if m:
name = m.group(1)
ex = urllib.unquote(m.group(2))
email = m.group(3)
author = name + ' <' + email + '>' + ex

if ' <none@none>' in commit.author:
author = commit.author[:-12]

Gregory Szorc

unread,
Sep 23, 2012, 8:03:34 PM9/23/12
to hg-...@googlegroups.com
# HG changeset patch
# User Gregory Szorc <gregor...@gmail.com>
# Date 1348281744 25200
# Node ID df4f8b7f800ff2a4c8a0c8e30575f38d5ea45fcb
# Parent fbb9ade686ffdd66ed30a7e9b68d9246409f7389
Precompile Git progress regular expressions

diff --git a/hggit/git_handler.py b/hggit/git_handler.py
--- a/hggit/git_handler.py
+++ b/hggit/git_handler.py
@@ -37,39 +37,42 @@

# Test for git:// and git+ssh:// URI.
# Support several URL forms, including separating the
# host and path with either a / or : (sepr)
RE_GIT_URI = re.compile(
r'^(?P<scheme>git([+]ssh)?://)(?P<host>.*?)(:(?P<port>\d+))?'
r'(?P<sepr>[:/])(?P<path>.*)$')

+RE_NEWLINES = re.compile('[\r\n]')
+RE_GIT_PROGRESS = re.compile('\((\d+)/(\d+)\)')
+
class GitProgress(object):
"""convert git server progress strings into mercurial progress"""
def __init__(self, ui):
self.ui = ui

self.lasttopic = None
self.msgbuf = ''

def progress(self, msg):
# 'Counting objects: 33640, done.\n'
# 'Compressing objects: 0% (1/9955) \r
- msgs = re.split('[\r\n]', self.msgbuf + msg)
+ msgs = RE_NEWLINES.split(self.msgbuf + msg)
self.msgbuf = msgs.pop()

for msg in msgs:
td = msg.split(':', 1)
data = td.pop()
if not td:
self.flush(data)
continue
topic = td[0]

- m = re.search('\((\d+)/(\d+)\)', data)
+ m = RE_GIT_PROGRESS.search(data)
if m:
if self.lasttopic and self.lasttopic != topic:
self.flush()
self.lasttopic = topic

pos, total = map(int, m.group(1, 2))
util.progress(self.ui, topic, pos, total=total)
else:

Gregory Szorc

unread,
Sep 23, 2012, 8:03:35 PM9/23/12
to hg-...@googlegroups.com
# HG changeset patch
# User Gregory Szorc <gregor...@gmail.com>
# Date 1348281830 25200
# Node ID 95b937230a1352d738c81bbe1e3b3a031e071956
# Parent df4f8b7f800ff2a4c8a0c8e30575f38d5ea45fcb
Precompile author file regular expression

diff --git a/hggit/git_handler.py b/hggit/git_handler.py
--- a/hggit/git_handler.py
+++ b/hggit/git_handler.py
@@ -40,16 +40,18 @@
# host and path with either a / or : (sepr)
RE_GIT_URI = re.compile(
r'^(?P<scheme>git([+]ssh)?://)(?P<host>.*?)(:(?P<port>\d+))?'
r'(?P<sepr>[:/])(?P<path>.*)$')

RE_NEWLINES = re.compile('[\r\n]')
RE_GIT_PROGRESS = re.compile('\((\d+)/(\d+)\)')

+RE_AUTHOR_FILE = re.compile('\s*=\s*')
+
class GitProgress(object):
"""convert git server progress strings into mercurial progress"""
def __init__(self, ui):
self.ui = ui

self.lasttopic = None
self.msgbuf = ''

@@ -120,17 +122,17 @@
if self.ui.config('git', 'authors'):
with open(self.repo.wjoin(
self.ui.config('git', 'authors')
)) as f:
for line in f:
line = line.strip()
if not line or line.startswith('#'):
continue
- from_, to = re.split(r'\s*=\s*', line, 2)
+ from_, to = RE_AUTHOR_FILE.split(line, 2)
self.author_map[from_] = to

## FILE LOAD AND SAVE METHODS

def map_set(self, gitsha, hgsha):
self._map_git[gitsha] = hgsha
self._map_hg[hgsha] = gitsha

Gregory Szorc

unread,
Sep 23, 2012, 8:03:36 PM9/23/12
to hg-...@googlegroups.com
# HG changeset patch
# User Gregory Szorc <gregor...@gmail.com>
# Date 1348284386 25200
# Node ID 2db03c124dde9c84de1006526f497d867094a231
# Parent 95b937230a1352d738c81bbe1e3b3a031e071956
Verify tree and parent objects are in Git repo

When exporting Git commits, verify that the tree and parents objects
exist in the repository before allowing the commit to be exported. If a
tree or parent commit is missing, then the repository is not valid and
the export should not be allowed.

diff --git a/hggit/git_handler.py b/hggit/git_handler.py
--- a/hggit/git_handler.py
+++ b/hggit/git_handler.py
@@ -379,24 +379,32 @@
commit.commit_time = commit.author_time
commit.commit_timezone = commit.author_timezone

commit.parents = []
for parent in self.get_git_parents(ctx):
hgsha = hex(parent.node())
git_sha = self.map_git_get(hgsha)
if git_sha:
+ if git_sha not in self.git.object_store:
+ raise hgutil.Abort(_('Parent SHA-1 not present in Git'
+ 'repo: %s' % git_sha))
+
commit.parents.append(git_sha)

commit.message = self.get_git_message(ctx)

if 'encoding' in extra:
commit.encoding = extra['encoding']

tree_sha = commit_tree(self.git.object_store, self.iterblobs(ctx))
+ if tree_sha not in self.git.object_store:
+ raise hgutil.Abort(_('Tree SHA-1 not present in Git repo: %s' %
+ tree_sha))
+
commit.tree = tree_sha

self.git.object_store.add_object(commit)
self.map_set(commit.id, ctx.hex())

self.swap_out_encoding(oldenc)
return commit.id

Gregory Szorc

unread,
Sep 23, 2012, 8:03:37 PM9/23/12
to hg-...@googlegroups.com
# HG changeset patch
# User Gregory Szorc <gregor...@gmail.com>
# Date 1348284630 25200
# Node ID ef583ac939de39b80aaff2d1d3d9f47bf1a1c9f3
# Parent 2db03c124dde9c84de1006526f497d867094a231
Make get_valid_git_username_email static

Also alias where it is used to make code a little easier to read.

diff --git a/hggit/git_handler.py b/hggit/git_handler.py
--- a/hggit/git_handler.py
+++ b/hggit/git_handler.py
@@ -403,17 +403,18 @@
commit.tree = tree_sha

self.git.object_store.add_object(commit)
self.map_set(commit.id, ctx.hex())

self.swap_out_encoding(oldenc)
return commit.id

- def get_valid_git_username_email(self, name):
+ @staticmethod
+ def get_valid_git_username_email(name):
r"""Sanitize usernames and emails to fit git's restrictions.

The following is taken from the man page of git's fast-import
command:

[...] Likewise LF means one (and only one) linefeed [...]

committer
@@ -435,17 +436,17 @@
angle brackets and spaces from the beginning, and right angle
brackets and spaces from the end, of this string, to convert
such things as " <jo...@doe.com> " to "jo...@doe.com" for
convenience.

TESTS:

>>> from mercurial.ui import ui
- >>> g = GitHandler('', ui()).get_valid_git_username_email
+ >>> g = GitHandler.get_valid_git_username_email
>>> g('John Doe')
'John Doe'
>>> g('jo...@doe.com')
'jo...@doe.com'
>>> g(' <jo...@doe.com> ')
'jo...@doe.com'
>>> g(' <random<\n<garbage\n> > > ')
'random???garbage?'
@@ -459,26 +460,28 @@
author = ctx.user()

# see if a translation exists
author = self.author_map.get(author, author)

# check for git author pattern compliance
a = RE_GIT_AUTHOR.match(author)

+ get_valid = GitHandler.get_valid_git_username_email
+
if a:
- name = self.get_valid_git_username_email(a.group(1))
- email = self.get_valid_git_username_email(a.group(2))
+ name = get_valid(a.group(1))
+ email = get_valid(a.group(2))
if a.group(3) != None and len(a.group(3)) != 0:
name += ' ext:(' + urllib.quote(a.group(3)) + ')'
- author = self.get_valid_git_username_email(name) + ' <' + self.get_valid_git_username_email(email) + '>'
+ author = get_valid(name) + ' <' + get_valid(email) + '>'
elif '@' in author:
- author = self.get_valid_git_username_email(author) + ' <' + self.get_valid_git_username_email(author) + '>'
+ author = get_valid(author) + ' <' + get_valid(author) + '>'
else:
- author = self.get_valid_git_username_email(author) + ' <none@none>'
+ author = get_valid(author) + ' <none@none>'

if 'author' in ctx.extra():
author = "".join(apply_delta(author, ctx.extra()['author']))

return author

def get_git_parents(self, ctx):
def is_octopus_part(ctx):

Gregory Szorc

unread,
Sep 23, 2012, 8:03:38 PM9/23/12
to hg-...@googlegroups.com
# HG changeset patch
# User Gregory Szorc <gregor...@gmail.com>
# Date 1348422117 25200
# Node ID 85c4b8e2e129975f400c9810eb9bf6ce6fea4c8b
# Parent ef583ac939de39b80aaff2d1d3d9f47bf1a1c9f3
Implement TreeTracker for incremental tree calculation

This class makes exporting Mercurial changesets to Git much faster.

diff --git a/hggit/git_handler.py b/hggit/git_handler.py
--- a/hggit/git_handler.py
+++ b/hggit/git_handler.py
@@ -1,13 +1,12 @@
import os, math, urllib, re
import stat, posixpath, StringIO

from dulwich.errors import HangupException, GitProtocolError, UpdateRefsError
-from dulwich.index import commit_tree
from dulwich.objects import Blob, Commit, Tag, Tree, parse_timezone, S_IFGITLINK
from dulwich.pack import create_delta, apply_delta
from dulwich.repo import Repo
from dulwich import client
from dulwich import config as dul_config

try:
from mercurial import bookmarks
@@ -24,16 +23,18 @@
from mercurial.node import hex, bin, nullid
from mercurial import context, util as hgutil
from mercurial import error

import _ssh
import util
from overlay import overlayrepo

+from .hg2git import TreeTracker
+
RE_GIT_AUTHOR = re.compile('^(.*?) ?\<(.*?)(?:\>(.*))?$')

RE_GIT_SANITIZE_AUTHOR = re.compile('[<>\n]')

RE_GIT_AUTHOR_EXTRA = re.compile('^(.*?)\ ext:\((.*)\) <(.*)\>$')

# Test for git:// and git+ssh:// URI.
# Support several URL forms, including separating the
@@ -323,32 +324,35 @@
def export_git_objects(self):
self.init_if_missing()

nodes = [self.repo.lookup(n) for n in self.repo]
export = [node for node in nodes if not hex(node) in self._map_hg]
total = len(export)
if total:
self.ui.status(_("exporting hg objects to git\n"))
+
+ tracker = TreeTracker(self.repo)
+
for i, rev in enumerate(export):
util.progress(self.ui, 'exporting', i, total=total)
ctx = self.repo.changectx(rev)
state = ctx.extra().get('hg-git', None)
if state == 'octopus':
self.ui.debug("revision %d is a part "
"of octopus explosion\n" % ctx.rev())
continue
- self.export_hg_commit(rev)
+ self.export_hg_commit(rev, tracker)
util.progress(self.ui, 'importing', None, total=total)


# convert this commit into git objects
# go through the manifest, convert all blobs/trees we don't have
# write the commit object (with metadata info)
- def export_hg_commit(self, rev):
+ def export_hg_commit(self, rev, tracker):
self.ui.note(_("converting revision %s\n") % hex(rev))

oldenc = self.swap_out_encoding()

ctx = self.repo.changectx(rev)
extra = ctx.extra()

commit = Commit()
@@ -390,17 +394,21 @@

commit.parents.append(git_sha)

commit.message = self.get_git_message(ctx)

if 'encoding' in extra:
commit.encoding = extra['encoding']

- tree_sha = commit_tree(self.git.object_store, self.iterblobs(ctx))
+ for obj in tracker.update_changeset(ctx):
+ self.git.object_store.add_object(obj)
+
+ tree_sha = tracker.root_tree_sha
+
if tree_sha not in self.git.object_store:
raise hgutil.Abort(_('Tree SHA-1 not present in Git repo: %s' %
tree_sha))

commit.tree = tree_sha

self.git.object_store.add_object(commit)
self.map_set(commit.id, ctx.hex())
@@ -536,53 +544,16 @@
add_extras = True
extra_message += "extra : " + key + " : " + urllib.quote(value) + "\n"

if add_extras:
message += "\n--HG--\n" + extra_message

return message

- def iterblobs(self, ctx):
- if '.hgsubstate' in ctx:
- hgsub = util.OrderedDict()
- if '.hgsub' in ctx:
- hgsub = util.parse_hgsub(ctx['.hgsub'].data().splitlines())
- hgsubstate = util.parse_hgsubstate(ctx['.hgsubstate'].data().splitlines())
- for path, sha in hgsubstate.iteritems():
- try:
- if path in hgsub and not hgsub[path].startswith('[git]'):
- # some other kind of a repository (e.g. [hg])
- # that keeps its state in .hgsubstate, shall ignore
- continue
- yield path, sha, S_IFGITLINK
- except ValueError:
- pass
-
- for f in ctx:
- if f == '.hgsubstate' or f == '.hgsub':
- continue
- fctx = ctx[f]
- blobid = self.map_git_get(hex(fctx.filenode()))
-
- if not blobid:
- blob = Blob.from_string(fctx.data())
- self.git.object_store.add_object(blob)
- self.map_set(blob.id, hex(fctx.filenode()))
- blobid = blob.id
-
- if 'l' in ctx.flags(f):
- mode = 0120000
- elif 'x' in ctx.flags(f):
- mode = 0100755
- else:
- mode = 0100644
-
- yield f, blobid, mode
-
def getnewgitcommits(self, refs=None):
self.init_if_missing()

# import heads and fetched tags as remote references
todo = []
done = set()
convert_list = {}

diff --git a/hggit/hg2git.py b/hggit/hg2git.py
new file mode 100644
--- /dev/null
+++ b/hggit/hg2git.py
@@ -0,0 +1,205 @@
+# This file contains code dealing specifically with converting Mercurial
+# repositories to Git repositories. Code in this file is meant to be a generic
+# library and should be usable outside the context of hg-git or an hg command.
+
+import os
+import stat
+
+from dulwich.objects import Blob
+from dulwich.objects import S_IFGITLINK
+from dulwich.objects import TreeEntry
+from dulwich.objects import Tree
+
+from mercurial import error as hgerror
+from mercurial.node import nullrev
+
+from . import util
+
+class TreeTracker(object):
+ """Tracks Git tree objects across Mercurial revisions.
+
+ The purpose of this class is to facilitate Git tree export that is more
+ optimal than brute force. The tree calculation part of this class is
+ essentially a reimplementation of dulwich.index.commit_tree. However, since
+ our implementation reuses Tree instances and only recalculates SHA-1 when
+ things change, we are much more efficient.
+
+ Callers instantiate this class against a mercurial.localrepo instance. They
+ then associate the tracker with a specific changeset by calling
+ update_changeset(). That function emits Git objects that need to be
+ exported to a Git repository. Callers then typically obtain the
+ root_tree_sha and use that as part of assembling a Git commit.
+ """
+
+ def __init__(self, hg_repo):
+ self._hg = hg_repo
+ self._rev = nullrev
+ self._dirs = {}
+ self._blob_cache = {}
+
+ @property
+ def root_tree_sha(self):
+ return self._dirs[''].id
+
+ def update_changeset(self, ctx):
+ """Set the tree to track a new Mercurial changeset.
+
+ This is a generator of dulwich Git objects. Each returned object can be
+ added to a Git store via add_object(). Some objects may already exist
+ in the Git repository. Emitted objects are either Blob or Tree
+ instances.
+
+ Emitted objects are those that have changed since the last call to
+ update_changeset.
+ """
+ # In theory we should be able to look at changectx.files(). This is
+ # *much* faster. However, it may not be accurate, especially with older
+ # repositories, which may not record things like deleted files
+ # explicitly in the manifest (which is where files() gets its data).
+ # The only reliable way to get the full set of changes is by looking at
+ # the full manifest. And, the easy way to compare two manifests is
+ # localrepo.status().
+
+ # The other members of status are only relevant when looking at the
+ # working directory.
+ modified, added, removed = self._hg.status(self._rev, ctx.rev())[0:3]
+
+ for path in sorted(removed, key=len, reverse=True):
+ d = os.path.dirname(path)
+ tree = self._dirs.get(d, Tree())
+
+ del tree[os.path.basename(path)]
+
+ if not len(tree):
+ self._remove_tree(d)
+ continue
+
+ self._dirs[d] = tree
+
+ for path in sorted(set(modified) | set(added), key=len, reverse=True):
+ if path == '.hgsubstate':
+ self._handle_subrepos(ctx)
+ continue
+
+ if path == '.hgsub':
+ continue
+
+ d = os.path.dirname(path)
+ tree = self._dirs.get(d, Tree())
+
+ fctx = ctx[path]
+
+ entry, blob = TreeTracker.tree_entry(fctx, self._blob_cache)
+ if blob is not None:
+ yield blob
+
+ tree.add(*entry)
+ self._dirs[d] = tree
+
+ for obj in self._populate_tree_entries():
+ yield obj
+
+ self._rev = ctx.rev()
+
+ def _remove_tree(self, path):
+ try:
+ del self._dirs[path]
+ except KeyError:
+ return
+
+ # Now we traverse up to the parent and delete any references.
+ if path == '':
+ return
+
+ basename = os.path.basename(path)
+ parent = os.path.dirname(path)
+ while True:
+ tree = self._dirs.get(parent, None)
+
+ # No parent entry. Nothing to remove or update.
+ if tree is None:
+ return
+
+ try:
+ del tree[basename]
+ except KeyError:
+ return
+
+ if len(tree):
+ return
+
+ # The parent tree is empty. Se, we can delete it.
+ del self._dirs[parent]
+
+ if parent == '':
+ return
+
+ basename = os.path.basename(parent)
+ parent = os.path.dirname(parent)
+
+ def _populate_tree_entries(self):
+ if '' not in self._dirs:
+ self._dirs[''] = Tree()
+
+ # Fill in missing directories.
+ for path in self._dirs.keys():
+ parent = os.path.dirname(path)
+
+ while parent != '':
+ parent_tree = self._dirs.get(parent, None)
+
+ if parent_tree is not None:
+ break
+
+ self._dirs[parent] = Tree()
+ parent = os.path.dirname(parent)
+
+ # TODO only emit trees that have been modified.
+ for d in sorted(self._dirs.keys(), key=len, reverse=True):
+ tree = self._dirs[d]
+ yield tree
+
+ if d == '':
+ continue
+
+ parent_tree = self._dirs[os.path.dirname(d)]
+ parent_tree[os.path.basename(d)] = (stat.S_IFDIR, tree.id)
+
+ def _handle_subrepos(self, ctx):
+ substate = util.parse_hgsubstate(ctx['.hgsubstate'].data().splitlines())
+ sub = util.OrderedDict()
+
+ if '.hgsub' in ctx:
+ sub = util.parse_hgsub(ctx['.hgsub'].data().splitlines())
+
+ for path, sha in substate.iteritems():
+ # Ignore non-Git repositories keeping state in .hgsubstate.
+ if path in sub and not sub[path].startswith('[git]'):
+ continue
+
+ d = os.path.dirname(path)
+ tree = self._dirs.get(d, Tree())
+ tree.add(os.path.basename(path), S_IFGITLINK, sha)
+ self._dirs[d] = tree
+
+ @staticmethod
+ def tree_entry(fctx, blob_cache):
+ blob_id = blob_cache.get(fctx.filenode(), None)
+ blob = None
+
+ if blob_id is None:
+ blob = Blob.from_string(fctx.data())
+ blob_id = blob.id
+ blob_cache[fctx.filenode()] = blob_id
+
+ flags = fctx.flags()
+
+ if 'l' in flags:
+ mode = 0120000
+ elif 'x' in flags:
+ mode = 0100755
+ else:
+ mode = 0100644
+
+ return (TreeEntry(os.path.basename(fctx.path()), mode, blob_id), blob)
+

David M. Carr

unread,
Sep 24, 2012, 9:02:11 AM9/24/12
to hg-...@googlegroups.com
> --
> You received this message because you are subscribed to the Google Groups "hg-git" group.
> To post to this group, send email to hg-...@googlegroups.com.
> To unsubscribe from this group, send email to hg-git+un...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/hg-git?hl=en.
>

Thanks for submitting these patches. Better performance exporting to
Git is definitely something I've been wishing for. I'll try to take a
closer look at them tonight.

--
David M. Carr
davi...@gmail.com

Augie Fackler

unread,
Sep 24, 2012, 9:56:42 AM9/24/12
to hg-...@googlegroups.com

On Sep 23, 2012, at 7:03 PM, Gregory Szorc <gregor...@gmail.com> wrote:

> # HG changeset patch
> # User Gregory Szorc <gregor...@gmail.com>
> # Date 1348284630 25200
> # Node ID ef583ac939de39b80aaff2d1d3d9f47bf1a1c9f3
> # Parent 2db03c124dde9c84de1006526f497d867094a231
> Make get_valid_git_username_email static

Rather than making this a static method, can we make it a module-level free function? Is there a reason not to?

Augie Fackler

unread,
Sep 24, 2012, 9:58:18 AM9/24/12
to hg-...@googlegroups.com
queued 1-7, dropped last two.

In the future, I'd appreciate commit messages along the lines of hg's style:

component: something succinct

Free form text here.

eg

git_handler: optimize get_git_author


Thanks!

On Sep 23, 2012, at 7:03 PM, Gregory Szorc <gregor...@gmail.com> wrote:

Gregory Szorc

unread,
Sep 24, 2012, 12:33:25 PM9/24/12
to hg-...@googlegroups.com, Augie Fackler
On 9/24/2012 6:58 AM, Augie Fackler wrote:
> queued 1-7, dropped last two.
>
> In the future, I'd appreciate commit messages along the lines of hg's style:
>
> component: something succinct
>
> Free form text here.
>
> eg
>
> git_handler: optimize get_git_author

Will do (this is my first set of patches to hg-git or Mercurial).

Gregory Szorc

unread,
Sep 24, 2012, 12:34:58 PM9/24/12
to hg-...@googlegroups.com, Augie Fackler
On 9/24/2012 6:56 AM, Augie Fackler wrote:
> On Sep 23, 2012, at 7:03 PM, Gregory Szorc <gregor...@gmail.com> wrote:
>
>> # HG changeset patch
>> # User Gregory Szorc <gregor...@gmail.com>
>> # Date 1348284630 25200
>> # Node ID ef583ac939de39b80aaff2d1d3d9f47bf1a1c9f3
>> # Parent 2db03c124dde9c84de1006526f497d867094a231
>> Make get_valid_git_username_email static
> Rather than making this a static method, can we make it a module-level free function? Is there a reason not to?

If that's the style you prefer, I see no reason why it can't be a
module-level function.

Gregory Szorc

unread,
Sep 24, 2012, 12:39:14 PM9/24/12
to hg-...@googlegroups.com
Having a night to sleep on it, I don't like dropping files/blobs from the map file. Not yet, anyway. Let me refactor this a little bit to add them back in. This should be a pretty minor and non-invasive change. So, I don't think you'll waste time looking at the other code before I get around to creating new new version of the patch.

David M. Carr

unread,
Sep 26, 2012, 3:30:19 AM9/26/12
to hg-...@googlegroups.com
I've now had a chance to look at these further. I can confirm that
with the patches applied, all-version-tests still passes for me.
Likewise, I can confirm that patch 9 appears to provide a nice
performance boost. I attempted to measure the performance impact of
patches 1-8, but wasn't able to measure any significant differences.
From a code review perspective, I don't have anything to add to what
Augie has already said.

David M. Carr

unread,
Nov 2, 2012, 4:39:05 PM11/2/12
to hg-...@googlegroups.com, gregor...@gmail.com
> --
> You received this message because you are subscribed to the Google Groups
> "hg-git" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/hg-git/-/vnRZ-31oknsJ.
>
> To post to this group, send email to hg-...@googlegroups.com.
> To unsubscribe from this group, send email to
> hg-git+un...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/hg-git?hl=en.

An luck at creating a new version of this patch?

--
David M. Carr
da...@carrclan.us

Max Horn

unread,
Nov 13, 2012, 5:49:03 AM11/13/12
to hg-...@googlegroups.com, David M. Carr, gregor...@gmail.com

On 02.11.2012, at 21:39, David M. Carr wrote:

> On Mon, Sep 24, 2012 at 12:39 PM, Gregory Szorc <gregor...@gmail.com> wrote:
>> Having a night to sleep on it, I don't like dropping files/blobs from the
>> map file. Not yet, anyway. Let me refactor this a little bit to add them
>> back in. This should be a pretty minor and non-invasive change. So, I don't
>> think you'll waste time looking at the other code before I get around to
>> creating new new version of the patch.

[...]

>
> An luck at creating a new version of this patch?

Yeah, I would love to know about that, too. A set of patches making hg-git faster is of course something many people would love to see :). It seems Gregory did some work on his patches for a few days after the above email (looking at <https://github.com/indygreg/hg-git/commits/performance-master>), and I would just love to know what the state is, and if one can help... Anyway, I hope Gregory will find some time to work on cleaning this up and re-submitting :).


Cheers,
Max

Gregory Szorc

unread,
Nov 18, 2012, 6:13:48 PM11/18/12
to David M. Carr, hg-...@googlegroups.com
On 11/2/2012 1:39 PM, David M. Carr wrote:
> On Mon, Sep 24, 2012 at 12:39 PM, Gregory Szorc <gregor...@gmail.com> wrote:
>> Having a night to sleep on it, I don't like dropping files/blobs from the
>> map file. Not yet, anyway. Let me refactor this a little bit to add them
>> back in. This should be a pretty minor and non-invasive change. So, I don't
>> think you'll waste time looking at the other code before I get around to
>> creating new new version of the patch.
>>
>>
>> An luck at creating a new version of this patch?
>>

I believe so, yes. I need to find some time to clean it up and resubmit
it for consideration. I've been extremely busy with other projects. The
latest version of the code is living at
https://github.com/indygreg/hg-git/tree/performance-next for anyone who
is interested. It probably needs rebased againt the latest tree.
Reply all
Reply to author
Forward
0 new messages