This series of patches is mostly centered around making exporting from
Mercurial to Git faster.
The initial patches are very small. I'm just pre-compiling a bunch of
regular expressions. This is considered a best-practice and will save
some CPU resources. Although, you probably won't be able to tell on the
wall time of conversions.
I also snuck a patch in there that verifies the tree and parent commits
exist in the Git repository before saving a new Git commit object. This
adds minimal overhead and catches a pretty obvious case of repository
corruption.
The big patch is at the end. I changed how Git trees are exported. The
new code is well-documented, so I won't describe it that much here. Just
know that there is a slight behavior change: blob IDs are no longer
saved to the ID mapping. Instead, the first time TreeTracker is fired
up,
it will export a blob it hasn't seen before, possibly redundantly with
something that's in the Git repo already. This shouldn't matter: Git
will happily figure things out on the next pack.
The new code passes the test suite. And, conversion of the actual
Mercurial repository yields identical commit hashes with the patches
applied. The only difference is it runs about 3x faster. Conversion of
mozilla-central also runs about 3x faster with this patch.
My next series of patches will center around doing tree export in
parallel. This should make things scale up to the number of cores in
your machine. The end of this patch series is a good stopping point
before I make this transition.
In my initial series of patches mailed to this list, I added versioning
of hg-git state. I'll probably re-submit these patches at some point. I
/might/ be a good idea to commit them before this patch series since I
deprecated storing blobs in the mapping file.
The patches in this series should apply cleanly on top of the next
branch.
+# Test for git:// and git+ssh:// URI.
+# Support several URL forms, including separating the
+# host and path with either a / or : (sepr)
+RE_GIT_URI = re.compile(
+ r'^(?P<scheme>git([+]ssh)?://)(?P<host>.*?)(:(?P<port>\d+))?'
+ r'(?P<sepr>[:/])(?P<path>.*)$')
+
class GitProgress(object):
"""convert git server progress strings into mercurial progress"""
def __init__(self, ui):
self.ui = ui
def get_transport_and_path(self, uri):
# pass hg's ui.ssh config to dulwich
if not issubclass(client.get_ssh_vendor, _ssh.SSHVendor):
client.get_ssh_vendor = _ssh.generate_ssh_vendor(self.ui)
- # Test for git:// and git+ssh:// URI.
- # Support several URL forms, including separating the
- # host and path with either a / or : (sepr)
- git_pattern = re.compile(
- r'^(?P<scheme>git([+]ssh)?://)(?P<host>.*?)(:(?P<port>\d+))?'
- r'(?P<sepr>[:/])(?P<path>.*)$'
- )
- git_match = git_pattern.match(uri)
+ git_match = RE_GIT_URI.match(uri)
if git_match:
res = git_match.groupdict()
transport = client.SSHGitClient if 'ssh' in res['scheme'] else client.TCPGitClient
host, port, sepr, path = res['host'], res['port'], res['sepr'], res['path']
if sepr == '/':
path = '/' + path
# strip trailing slash for heroku-style URLs
# ssh+git://...@heroku.com:project.git/
+RE_GIT_SANITIZE_AUTHOR = re.compile('[<>\n]')
+
# Test for git:// and git+ssh:// URI.
# Support several URL forms, including separating the
# host and path with either a / or : (sepr)
RE_GIT_URI = re.compile(
r'^(?P<scheme>git([+]ssh)?://)(?P<host>.*?)(:(?P<port>\d+))?'
r'(?P<sepr>[:/])(?P<path>.*)$')
+RE_GIT_AUTHOR_EXTRA = re.compile('^(.*?)\ ext:\((.*)\) <(.*)\>$')
+
# Test for git:// and git+ssh:// URI.
# Support several URL forms, including separating the
# host and path with either a / or : (sepr)
RE_GIT_URI = re.compile(
r'^(?P<scheme>git([+]ssh)?://)(?P<host>.*?)(:(?P<port>\d+))?'
r'(?P<sepr>[:/])(?P<path>.*)$')
class GitProgress(object):
@@ -702,18 +704,17 @@
text = '\n'.join([l.rstrip() for l in text.splitlines()]).strip('\n')
if text + '\n' != origtext:
extra['message'] = create_delta(text +'\n', origtext)
author = commit.author
# convert extra data back to the end
if ' ext:' in commit.author:
- regex = re.compile('^(.*?)\ ext:\((.*)\) <(.*)\>$')
- m = regex.match(commit.author)
+ m = RE_GIT_AUTHOR_EXTRA.match(commit.author)
if m:
name = m.group(1)
ex = urllib.unquote(m.group(2))
email = m.group(3)
author = name + ' <' + email + '>' + ex
if ' <none@none>' in commit.author:
author = commit.author[:-12]
# Test for git:// and git+ssh:// URI.
# Support several URL forms, including separating the
# host and path with either a / or : (sepr)
RE_GIT_URI = re.compile(
r'^(?P<scheme>git([+]ssh)?://)(?P<host>.*?)(:(?P<port>\d+))?'
r'(?P<sepr>[:/])(?P<path>.*)$')
+RE_NEWLINES = re.compile('[\r\n]')
+RE_GIT_PROGRESS = re.compile('\((\d+)/(\d+)\)')
+
class GitProgress(object):
"""convert git server progress strings into mercurial progress"""
def __init__(self, ui):
self.ui = ui
for msg in msgs:
td = msg.split(':', 1)
data = td.pop()
if not td:
self.flush(data)
continue
topic = td[0]
- m = re.search('\((\d+)/(\d+)\)', data)
+ m = RE_GIT_PROGRESS.search(data)
if m:
if self.lasttopic and self.lasttopic != topic:
self.flush()
self.lasttopic = topic
+RE_AUTHOR_FILE = re.compile('\s*=\s*')
+
class GitProgress(object):
"""convert git server progress strings into mercurial progress"""
def __init__(self, ui):
self.ui = ui
self.lasttopic = None
self.msgbuf = ''
@@ -120,17 +122,17 @@
if self.ui.config('git', 'authors'):
with open(self.repo.wjoin(
self.ui.config('git', 'authors')
)) as f:
for line in f:
line = line.strip()
if not line or line.startswith('#'):
continue
- from_, to = re.split(r'\s*=\s*', line, 2)
+ from_, to = RE_AUTHOR_FILE.split(line, 2)
self.author_map[from_] = to
# HG changeset patch
# User Gregory Szorc <gregory.sz...@gmail.com>
# Date 1348284386 25200
# Node ID 2db03c124dde9c84de1006526f497d867094a231
# Parent 95b937230a1352d738c81bbe1e3b3a031e071956
Verify tree and parent objects are in Git repo
When exporting Git commits, verify that the tree and parents objects
exist in the repository before allowing the commit to be exported. If a
tree or parent commit is missing, then the repository is not valid and
the export should not be allowed.
commit.parents = []
for parent in self.get_git_parents(ctx):
hgsha = hex(parent.node())
git_sha = self.map_git_get(hgsha)
if git_sha:
+ if git_sha not in self.git.object_store:
+ raise hgutil.Abort(_('Parent SHA-1 not present in Git'
+ 'repo: %s' % git_sha))
+
commit.parents.append(git_sha)
commit.message = self.get_git_message(ctx)
if 'encoding' in extra:
commit.encoding = extra['encoding']
tree_sha = commit_tree(self.git.object_store, self.iterblobs(ctx))
+ if tree_sha not in self.git.object_store:
+ raise hgutil.Abort(_('Tree SHA-1 not present in Git repo: %s' %
+ tree_sha))
+
commit.tree = tree_sha
- def get_valid_git_username_email(self, name):
+ @staticmethod
+ def get_valid_git_username_email(name):
r"""Sanitize usernames and emails to fit git's restrictions.
The following is taken from the man page of git's fast-import
command:
[...] Likewise LF means one (and only one) linefeed [...]
committer
@@ -435,17 +436,17 @@
angle brackets and spaces from the beginning, and right angle
brackets and spaces from the end, of this string, to convert
such things as " <j...@doe.com> " to "j...@doe.com" for
convenience.
# Test for git:// and git+ssh:// URI.
# Support several URL forms, including separating the
@@ -323,32 +324,35 @@
def export_git_objects(self):
self.init_if_missing()
nodes = [self.repo.lookup(n) for n in self.repo]
export = [node for node in nodes if not hex(node) in self._map_hg]
total = len(export)
if total:
self.ui.status(_("exporting hg objects to git\n"))
+
+ tracker = TreeTracker(self.repo)
+
for i, rev in enumerate(export):
util.progress(self.ui, 'exporting', i, total=total)
ctx = self.repo.changectx(rev)
state = ctx.extra().get('hg-git', None)
if state == 'octopus':
self.ui.debug("revision %d is a part "
"of octopus explosion\n" % ctx.rev())
continue
- self.export_hg_commit(rev)
+ self.export_hg_commit(rev, tracker)
util.progress(self.ui, 'importing', None, total=total)
# convert this commit into git objects
# go through the manifest, convert all blobs/trees we don't have
# write the commit object (with metadata info)
- def export_hg_commit(self, rev):
+ def export_hg_commit(self, rev, tracker):
self.ui.note(_("converting revision %s\n") % hex(rev))
oldenc = self.swap_out_encoding()
ctx = self.repo.changectx(rev)
extra = ctx.extra()
commit = Commit()
@@ -390,17 +394,21 @@
commit.parents.append(git_sha)
commit.message = self.get_git_message(ctx)
if 'encoding' in extra:
commit.encoding = extra['encoding']
- tree_sha = commit_tree(self.git.object_store, self.iterblobs(ctx))
+ for obj in tracker.update_changeset(ctx):
+ self.git.object_store.add_object(obj)
+
+ tree_sha = tracker.root_tree_sha
+
if tree_sha not in self.git.object_store:
raise hgutil.Abort(_('Tree SHA-1 not present in Git repo: %s' %
tree_sha))
if add_extras:
message += "\n--HG--\n" + extra_message
return message
- def iterblobs(self, ctx):
- if '.hgsubstate' in ctx:
- hgsub = util.OrderedDict()
- if '.hgsub' in ctx:
- hgsub = util.parse_hgsub(ctx['.hgsub'].data().splitlines())
- hgsubstate = util.parse_hgsubstate(ctx['.hgsubstate'].data().splitlines())
- for path, sha in hgsubstate.iteritems():
- try:
- if path in hgsub and not hgsub[path].startswith('[git]'):
- # some other kind of a repository (e.g. [hg])
- # that keeps its state in .hgsubstate, shall ignore
- continue
- yield path, sha, S_IFGITLINK
- except ValueError:
- pass
-
- for f in ctx:
- if f == '.hgsubstate' or f == '.hgsub':
- continue
- fctx = ctx[f]
- blobid = self.map_git_get(hex(fctx.filenode()))
-
- if not blobid:
- blob = Blob.from_string(fctx.data())
- self.git.object_store.add_object(blob)
- self.map_set(blob.id, hex(fctx.filenode()))
- blobid = blob.id
-
- if 'l' in ctx.flags(f):
- mode = 0120000
- elif 'x' in ctx.flags(f):
- mode = 0100755
- else:
- mode = 0100644
-
- yield f, blobid, mode
-
def getnewgitcommits(self, refs=None):
self.init_if_missing()
# import heads and fetched tags as remote references
todo = []
done = set()
convert_list = {}
diff --git a/hggit/hg2git.py b/hggit/hg2git.py
new file mode 100644
--- /dev/null
+++ b/hggit/hg2git.py
@@ -0,0 +1,205 @@
+# This file contains code dealing specifically with converting Mercurial
+# repositories to Git repositories. Code in this file is meant to be a generic
+# library and should be usable outside the context of hg-git or an hg command.
+
+import os
+import stat
+
+from dulwich.objects import Blob
+from dulwich.objects import S_IFGITLINK
+from dulwich.objects import TreeEntry
+from dulwich.objects import Tree
+
+from mercurial import error as hgerror
+from mercurial.node import nullrev
+
+from . import util
+
+class TreeTracker(object):
+ """Tracks Git tree objects across Mercurial revisions.
+
+ The purpose of this class is to facilitate Git tree export that is more
+ optimal than brute force. The tree calculation part of this class is
+ essentially a reimplementation of dulwich.index.commit_tree. However, since
+ our implementation reuses Tree instances and only recalculates SHA-1 when
+ things change, we are much more efficient.
+
+ Callers instantiate this class against a mercurial.localrepo instance. They
+ then associate the tracker with a specific changeset by calling
+ update_changeset(). That function emits Git objects that need to be
+ exported to a Git repository. Callers then typically obtain the
+ root_tree_sha and use that as part of assembling a Git commit.
+ """
+
+ def __init__(self, hg_repo):
+ self._hg = hg_repo
+ self._rev = nullrev
+ self._dirs = {}
+ self._blob_cache = {}
+
+ @property
+ def root_tree_sha(self):
+ return self._dirs[''].id
+
+ def update_changeset(self, ctx):
+ """Set the tree to track a new Mercurial changeset.
+
+ This is a generator of dulwich Git objects. Each returned object can be
+ added to a Git store via add_object(). Some objects may already exist
+ in the Git repository. Emitted objects are either Blob or Tree
+ instances.
+
+ Emitted objects are those that have changed since the last call to
+ update_changeset.
+ """
+ # In theory we should be able to look at changectx.files(). This is
+ # *much* faster. However, it may not be accurate, especially with older
+ # repositories, which may not record things like deleted files
+ # explicitly in the manifest (which is where files() gets its data).
+ # The only reliable way to get the full set of changes is by looking at
+ # the full manifest. And, the easy way to compare two manifests is
+ # localrepo.status().
+
+ # The other members of status are only relevant when looking at the
+ # working directory.
+ modified, added, removed = self._hg.status(self._rev, ctx.rev())[0:3]
+
+ for path in sorted(removed, key=len, reverse=True):
+ d = os.path.dirname(path)
+ tree = self._dirs.get(d, Tree())
+
+ del tree[os.path.basename(path)]
+
+ if not len(tree):
+ self._remove_tree(d)
+ continue
+
+ self._dirs[d] = tree
+
+ for path in sorted(set(modified) | set(added), key=len, reverse=True):
+ if path == '.hgsubstate':
+ self._handle_subrepos(ctx)
+ continue
+
+ if path == '.hgsub':
+ continue
+
+ d = os.path.dirname(path)
+ tree = self._dirs.get(d, Tree())
+
+ fctx = ctx[path]
+
+ entry, blob = TreeTracker.tree_entry(fctx, self._blob_cache)
+ if blob is not None:
+ yield blob
+
+ tree.add(*entry)
+ self._dirs[d] = tree
+
+ for obj in self._populate_tree_entries():
+ yield obj
+
+ self._rev = ctx.rev()
+
+ def _remove_tree(self, path):
+ try:
+ del self._dirs[path]
+ except KeyError:
+ return
+
+ # Now we traverse up to the parent and delete any references.
+ if path == '':
+ return
+
+ basename = os.path.basename(path)
+ parent = os.path.dirname(path)
+ while True:
+ tree = self._dirs.get(parent, None)
+
+ # No parent entry. Nothing to
On Sun, Sep 23, 2012 at 8:03 PM, Gregory Szorc <gregory.sz...@gmail.com> wrote:
> This series of patches is mostly centered around making exporting from
> Mercurial to Git faster.
> The initial patches are very small. I'm just pre-compiling a bunch of
> regular expressions. This is considered a best-practice and will save
> some CPU resources. Although, you probably won't be able to tell on the
> wall time of conversions.
> I also snuck a patch in there that verifies the tree and parent commits
> exist in the Git repository before saving a new Git commit object. This
> adds minimal overhead and catches a pretty obvious case of repository
> corruption.
> The big patch is at the end. I changed how Git trees are exported. The
> new code is well-documented, so I won't describe it that much here. Just
> know that there is a slight behavior change: blob IDs are no longer
> saved to the ID mapping. Instead, the first time TreeTracker is fired
> up,
> it will export a blob it hasn't seen before, possibly redundantly with
> something that's in the Git repo already. This shouldn't matter: Git
> will happily figure things out on the next pack.
> The new code passes the test suite. And, conversion of the actual
> Mercurial repository yields identical commit hashes with the patches
> applied. The only difference is it runs about 3x faster. Conversion of
> mozilla-central also runs about 3x faster with this patch.
> My next series of patches will center around doing tree export in
> parallel. This should make things scale up to the number of cores in
> your machine. The end of this patch series is a good stopping point
> before I make this transition.
> In my initial series of patches mailed to this list, I added versioning
> of hg-git state. I'll probably re-submit these patches at some point. I
> /might/ be a good idea to commit them before this patch series since I
> deprecated storing blobs in the mapping file.
> The patches in this series should apply cleanly on top of the next
> branch.
> --
> You received this message because you are subscribed to the Google Groups "hg-git" group.
> To post to this group, send email to hg-git@googlegroups.com.
> To unsubscribe from this group, send email to hg-git+unsubscribe@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/hg-git?hl=en.
Thanks for submitting these patches. Better performance exporting to
Git is definitely something I've been wishing for. I'll try to take a
closer look at them tonight.
> - def get_valid_git_username_email(self, name):
> + @staticmethod
> + def get_valid_git_username_email(name):
> r"""Sanitize usernames and emails to fit git's restrictions.
> The following is taken from the man page of git's fast-import
> command:
> [...] Likewise LF means one (and only one) linefeed [...]
> committer
> @@ -435,17 +436,17 @@
> angle brackets and spaces from the beginning, and right angle
> brackets and spaces from the end, of this string, to convert
> such things as " <j...@doe.com> " to "j...@doe.com" for
> convenience.
> -- > You received this message because you are subscribed to the Google Groups "hg-git" group.
> To post to this group, send email to hg-git@googlegroups.com.
> To unsubscribe from this group, send email to hg-git+unsubscribe@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/hg-git?hl=en.
> This series of patches is mostly centered around making exporting from
> Mercurial to Git faster.
> The initial patches are very small. I'm just pre-compiling a bunch of
> regular expressions. This is considered a best-practice and will save
> some CPU resources. Although, you probably won't be able to tell on the
> wall time of conversions.
> I also snuck a patch in there that verifies the tree and parent commits
> exist in the Git repository before saving a new Git commit object. This
> adds minimal overhead and catches a pretty obvious case of repository
> corruption.
> The big patch is at the end. I changed how Git trees are exported. The
> new code is well-documented, so I won't describe it that much here. Just
> know that there is a slight behavior change: blob IDs are no longer
> saved to the ID mapping. Instead, the first time TreeTracker is fired
> up,
> it will export a blob it hasn't seen before, possibly redundantly with
> something that's in the Git repo already. This shouldn't matter: Git
> will happily figure things out on the next pack.
> The new code passes the test suite. And, conversion of the actual
> Mercurial repository yields identical commit hashes with the patches
> applied. The only difference is it runs about 3x faster. Conversion of
> mozilla-central also runs about 3x faster with this patch.
> My next series of patches will center around doing tree export in
> parallel. This should make things scale up to the number of cores in
> your machine. The end of this patch series is a good stopping point
> before I make this transition.
> In my initial series of patches mailed to this list, I added versioning
> of hg-git state. I'll probably re-submit these patches at some point. I
> /might/ be a good idea to commit them before this patch series since I
> deprecated storing blobs in the mapping file.
> The patches in this series should apply cleanly on top of the next
> branch.
> -- > You received this message because you are subscribed to the Google Groups "hg-git" group.
> To post to this group, send email to hg-git@googlegroups.com.
> To unsubscribe from this group, send email to hg-git+unsubscribe@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/hg-git?hl=en.
> On Sep 23, 2012, at 7:03 PM, Gregory Szorc <gregory.sz...@gmail.com> wrote:
>> # HG changeset patch
>> # User Gregory Szorc <gregory.sz...@gmail.com>
>> # Date 1348284630 25200
>> # Node ID ef583ac939de39b80aaff2d1d3d9f47bf1a1c9f3
>> # Parent 2db03c124dde9c84de1006526f497d867094a231
>> Make get_valid_git_username_email static
> Rather than making this a static method, can we make it a module-level free function? Is there a reason not to?
If that's the style you prefer, I see no reason why it can't be a module-level function.
Having a night to sleep on it, I don't like dropping files/blobs from the map file. Not yet, anyway. Let me refactor this a little bit to add them back in. This should be a pretty minor and non-invasive change. So, I don't think you'll waste time looking at the other code before I get around to creating new new version of the patch.
> # Test for git:// and git+ssh:// URI. > # Support several URL forms, including separating the > @@ -323,32 +324,35 @@ > def export_git_objects(self): > self.init_if_missing()
> nodes = [self.repo.lookup(n) for n in self.repo] > export = [node for node in nodes if not hex(node) in > self._map_hg] > total = len(export) > if total: > self.ui.status(_("exporting hg objects to git\n")) > + > + tracker = TreeTracker(self.repo) > + > for i, rev in enumerate(export): > util.progress(self.ui, 'exporting', i, total=total) > ctx = self.repo.changectx(rev) > state = ctx.extra().get('hg-git', None) > if state == 'octopus': > self.ui.debug("revision %d is a part " > "of octopus explosion\n" % ctx.rev()) > continue > - self.export_hg_commit(rev) > + self.export_hg_commit(rev, tracker) > util.progress(self.ui, 'importing', None, total=total)
> # convert this commit into git objects > # go through the manifest, convert all blobs/trees we don't have > # write the commit object (with metadata info) > - def export_hg_commit(self, rev): > + def export_hg_commit(self, rev, tracker): > self.ui.note(_("converting revision %s\n") % hex(rev))
> oldenc = self.swap_out_encoding()
> ctx = self.repo.changectx(rev) > extra = ctx.extra()
> commit = Commit() > @@ -390,17 +394,21 @@
> commit.parents.append(git_sha)
> commit.message = self.get_git_message(ctx)
> if 'encoding' in extra: > commit.encoding = extra['encoding']
> - tree_sha = commit_tree(self.git.object_store, > self.iterblobs(ctx)) > + for obj in tracker.update_changeset(ctx): > + self.git.object_store.add_object(obj) > + > + tree_sha = tracker.root_tree_sha > + > if tree_sha not in self.git.object_store: > raise hgutil.Abort(_('Tree SHA-1 not present in Git repo: %s' > % > tree_sha))
> if add_extras: > message += "\n--HG--\n" + extra_message
> return message
> - def iterblobs(self, ctx): > - if '.hgsubstate' in ctx: > - hgsub = util.OrderedDict() > - if '.hgsub' in ctx: > - hgsub = > util.parse_hgsub(ctx['.hgsub'].data().splitlines()) > - hgsubstate = > util.parse_hgsubstate(ctx['.hgsubstate'].data().splitlines()) > - for path, sha in hgsubstate.iteritems(): > - try: > - if path in hgsub and not > hgsub[path].startswith('[git]'): > - # some other kind of a repository (e.g. [hg]) > - # that keeps its state in .hgsubstate, shall > ignore > - continue > - yield path, sha, S_IFGITLINK > - except ValueError: > - pass > - > - for f in ctx: > - if f == '.hgsubstate' or f == '.hgsub': > - continue > - fctx = ctx[f] > - blobid = self.map_git_get(hex(fctx.filenode())) > - > - if not blobid: > - blob = Blob.from_string(fctx.data()) > - self.git.object_store.add_object(blob) > - self.map_set(blob.id, hex(fctx.filenode())) > - blobid = blob.id > - > - if 'l' in ctx.flags(f): > - mode = 0120000 > - elif 'x' in ctx.flags(f): > - mode = 0100755 > - else: > - mode = 0100644 > - > - yield f, blobid, mode > - > def getnewgitcommits(self, refs=None): > self.init_if_missing()
> # import heads and fetched tags as remote references > todo = [] > done = set() > convert_list = {}
> diff --git a/hggit/hg2git.py b/hggit/hg2git.py > new file mode 100644 > --- /dev/null > +++ b/hggit/hg2git.py > @@ -0,0 +1,205 @@ > +# This file contains code dealing specifically with converting Mercurial > +# repositories to Git repositories. Code in this file is meant to be a > generic > +# library and should be usable outside the context of hg-git or an hg > command. > + > +import os > +import stat > + > +from dulwich.objects import Blob > +from dulwich.objects import S_IFGITLINK > +from dulwich.objects import TreeEntry > +from dulwich.objects import Tree > + > +from mercurial import error as hgerror > +from mercurial.node import nullrev > + > +from . import util > + > +class TreeTracker(object): > + """Tracks Git tree objects across Mercurial revisions. > + > + The purpose of this class is to facilitate Git tree export that is > more > + optimal than brute force. The tree calculation part of this class is > + essentially a reimplementation of dulwich.index.commit_tree. However, > since > + our implementation reuses Tree instances and only recalculates SHA-1 > when > + things change, we are much more efficient. > + > + Callers instantiate this class against a mercurial.localrepo > instance. They > + then associate the tracker with a specific changeset by calling > + update_changeset(). That function emits Git objects that need to be > + exported to a Git repository. Callers then typically obtain the > + root_tree_sha and use that as part of assembling a Git commit. > + """ > + > + def __init__(self, hg_repo): > + self._hg = hg_repo > + self._rev = nullrev > + self._dirs = {} > + self._blob_cache = {} > + > + @property > + def root_tree_sha(self): > + return self._dirs[''].id > + > + def update_changeset(self, ctx): > + """Set the tree to track a new Mercurial changeset. > + > + This is a generator of dulwich Git objects. Each returned object > can be > + added to a Git store via add_object(). Some objects may already > exist > + in the Git repository. Emitted objects are either Blob or Tree > + instances. > + > + Emitted objects are those that have changed since the last call > to > + update_changeset. > + """ > + # In theory we should be able to look at changectx.files(). This > is > + # *much* faster. However, it may not be accurate, especially with > older > + # repositories, which may not record things like deleted files > + # explicitly in the manifest (which is where files() gets its > data). > + # The only reliable way to get the full set of changes is by > looking at > + # the full manifest. And, the easy way to compare two manifests > is > + # localrepo.status(). > + > + # The other members of status are only relevant when looking at > the > + # working directory. > + modified, added, removed = self._hg.status(self._rev, > ctx.rev())[0:3] > + > + for path in sorted(removed, key=len, reverse=True): > + d = os.path.dirname(path) > + tree = self._dirs.get(d, Tree()) > + > + del tree[os.path.basename(path)] > + > + if not len(tree): > + self._remove_tree(d) > + continue > + > + self._dirs[d] = tree > + > + for path in sorted(set(modified) | set(added), key=len, > reverse=True): > + if path == '.hgsubstate': > + self._handle_subrepos(ctx) > + continue
> On Sun, Sep 23, 2012 at 8:03 PM, Gregory Szorc <gregory.sz...@gmail.com> wrote:
>> This series of patches is mostly centered around making exporting from
>> Mercurial to Git faster.
>> The initial patches are very small. I'm just pre-compiling a bunch of
>> regular expressions. This is considered a best-practice and will save
>> some CPU resources. Although, you probably won't be able to tell on the
>> wall time of conversions.
>> I also snuck a patch in there that verifies the tree and parent commits
>> exist in the Git repository before saving a new Git commit object. This
>> adds minimal overhead and catches a pretty obvious case of repository
>> corruption.
>> The big patch is at the end. I changed how Git trees are exported. The
>> new code is well-documented, so I won't describe it that much here. Just
>> know that there is a slight behavior change: blob IDs are no longer
>> saved to the ID mapping. Instead, the first time TreeTracker is fired
>> up,
>> it will export a blob it hasn't seen before, possibly redundantly with
>> something that's in the Git repo already. This shouldn't matter: Git
>> will happily figure things out on the next pack.
>> The new code passes the test suite. And, conversion of the actual
>> Mercurial repository yields identical commit hashes with the patches
>> applied. The only difference is it runs about 3x faster. Conversion of
>> mozilla-central also runs about 3x faster with this patch.
>> My next series of patches will center around doing tree export in
>> parallel. This should make things scale up to the number of cores in
>> your machine. The end of this patch series is a good stopping point
>> before I make this transition.
>> In my initial series of patches mailed to this list, I added versioning
>> of hg-git state. I'll probably re-submit these patches at some point. I
>> /might/ be a good idea to commit them before this patch series since I
>> deprecated storing blobs in the mapping file.
>> The patches in this series should apply cleanly on top of the next
>> branch.
>> --
>> You received this message because you are subscribed to the Google Groups "hg-git" group.
>> To post to this group, send email to hg-git@googlegroups.com.
>> To unsubscribe from this group, send email to hg-git+unsubscribe@googlegroups.com.
>> For more options, visit this group at http://groups.google.com/group/hg-git?hl=en.
> Thanks for submitting these patches. Better performance exporting to
> Git is definitely something I've been wishing for. I'll try to take a
> closer look at them tonight.
I've now had a chance to look at these further. I can confirm that
with the patches applied, all-version-tests still passes for me.
Likewise, I can confirm that patch 9 appears to provide a nice
performance boost. I attempted to measure the performance impact of
patches 1-8, but wasn't able to measure any significant differences.
From a code review perspective, I don't have anything to add to what
Augie has already said.
On Mon, Sep 24, 2012 at 12:39 PM, Gregory Szorc <gregory.sz...@gmail.com> wrote:
> Having a night to sleep on it, I don't like dropping files/blobs from the
> map file. Not yet, anyway. Let me refactor this a little bit to add them
> back in. This should be a pretty minor and non-invasive change. So, I don't
> think you'll waste time looking at the other code before I get around to
> creating new new version of the patch.
> On Sunday, September 23, 2012 5:04:21 PM UTC-7, Gregory Szorc wrote:
>> # HG changeset patch
>> # User Gregory Szorc <gregory.sz...@gmail.com>
>> # Date 1348422117 25200
>> # Node ID 85c4b8e2e129975f400c9810eb9bf6ce6fea4c8b
>> # Parent ef583ac939de39b80aaff2d1d3d9f47bf1a1c9f3
>> Implement TreeTracker for incremental tree calculation
>> This class makes exporting Mercurial changesets to Git much faster.
>> # Test for git:// and git+ssh:// URI.
>> # Support several URL forms, including separating the
>> @@ -323,32 +324,35 @@
>> def export_git_objects(self):
>> self.init_if_missing()
>> nodes = [self.repo.lookup(n) for n in self.repo]
>> export = [node for node in nodes if not hex(node) in
>> self._map_hg]
>> total = len(export)
>> if total:
>> self.ui.status(_("exporting hg objects to git\n"))
>> +
>> + tracker = TreeTracker(self.repo)
>> +
>> for i, rev in enumerate(export):
>> util.progress(self.ui, 'exporting', i, total=total)
>> ctx = self.repo.changectx(rev)
>> state = ctx.extra().get('hg-git', None)
>> if state == 'octopus':
>> self.ui.debug("revision %d is a part "
>> "of octopus explosion\n" % ctx.rev())
>> continue
>> - self.export_hg_commit(rev)
>> + self.export_hg_commit(rev, tracker)
>> util.progress(self.ui, 'importing', None, total=total)
>> # convert this commit into git objects
>> # go through the manifest, convert all blobs/trees we don't have
>> # write the commit object (with metadata info)
>> - def export_hg_commit(self, rev):
>> + def export_hg_commit(self, rev, tracker):
>> self.ui.note(_("converting revision %s\n") % hex(rev))
>> oldenc = self.swap_out_encoding()
>> ctx = self.repo.changectx(rev)
>> extra = ctx.extra()
>> commit = Commit()
>> @@ -390,17 +394,21 @@
>> commit.parents.append(git_sha)
>> commit.message = self.get_git_message(ctx)
>> if 'encoding' in extra:
>> commit.encoding = extra['encoding']
>> - tree_sha = commit_tree(self.git.object_store,
>> self.iterblobs(ctx))
>> + for obj in tracker.update_changeset(ctx):
>> + self.git.object_store.add_object(obj)
>> +
>> + tree_sha = tracker.root_tree_sha
>> +
>> if tree_sha not in self.git.object_store:
>> raise hgutil.Abort(_('Tree SHA-1 not present in Git repo: %s'
>> %
>> tree_sha))
>> if add_extras:
>> message += "\n--HG--\n" + extra_message
>> return message
>> - def iterblobs(self, ctx):
>> - if '.hgsubstate' in ctx:
>> - hgsub = util.OrderedDict()
>> - if '.hgsub' in ctx:
>> - hgsub =
>> util.parse_hgsub(ctx['.hgsub'].data().splitlines())
>> - hgsubstate =
>> util.parse_hgsubstate(ctx['.hgsubstate'].data().splitlines())
>> - for path, sha in hgsubstate.iteritems():
>> - try:
>> - if path in hgsub and not
>> hgsub[path].startswith('[git]'):
>> - # some other kind of a repository (e.g. [hg])
>> - # that keeps its state in .hgsubstate, shall
>> ignore
>> - continue
>> - yield path, sha, S_IFGITLINK
>> - except ValueError:
>> - pass
>> -
>> - for f in ctx:
>> - if f == '.hgsubstate' or f == '.hgsub':
>> - continue
>> - fctx = ctx[f]
>> - blobid = self.map_git_get(hex(fctx.filenode()))
>> -
>> - if not blobid:
>> - blob = Blob.from_string(fctx.data())
>> - self.git.object_store.add_object(blob)
>> - self.map_set(blob.id, hex(fctx.filenode()))
>> - blobid = blob.id
>> -
>> - if 'l' in ctx.flags(f):
>> - mode = 0120000
>> - elif 'x' in ctx.flags(f):
>> - mode = 0100755
>> - else:
>> - mode = 0100644
>> -
>> - yield f, blobid, mode
>> -
>> def getnewgitcommits(self, refs=None):
>> self.init_if_missing()
>> # import heads and fetched tags as remote references
>> todo = []
>> done = set()
>> convert_list = {}
>> diff --git a/hggit/hg2git.py b/hggit/hg2git.py
>> new file mode 100644
>> --- /dev/null
>> +++ b/hggit/hg2git.py
>> @@ -0,0 +1,205 @@
>> +# This file contains code dealing specifically with converting Mercurial
>> +# repositories to Git repositories. Code in this file is meant to be a
>> generic
>> +# library and should be usable outside the context of hg-git or an hg
>> command.
>> +
>> +import os
>> +import stat
>> +
>> +from dulwich.objects import Blob
>> +from dulwich.objects import S_IFGITLINK
>> +from dulwich.objects import TreeEntry
>> +from dulwich.objects import Tree
>> +
>> +from mercurial import error as hgerror
>> +from mercurial.node import nullrev
>> +
>> +from . import util
>> +
>> +class TreeTracker(object):
>> + """Tracks Git tree objects across Mercurial revisions.
>> +
>> + The purpose of this class is to facilitate Git tree export that is
>> more
>> + optimal than brute force. The tree calculation part of this class is
>> + essentially a reimplementation of dulwich.index.commit_tree. However,
>> since
>> + our implementation reuses Tree instances and only recalculates SHA-1
>> when
>> + things change, we are much more efficient.
>> +
>> + Callers instantiate this class against a mercurial.localrepo
>> instance. They
>> + then associate the tracker with a specific changeset by calling
>> + update_changeset(). That function emits Git objects that need to be
>> + exported to a Git repository. Callers then typically obtain the
>> + root_tree_sha and use that as part of assembling a Git commit.
>> + """
>> +
>> + def __init__(self, hg_repo):
>> + self._hg = hg_repo
>> + self._rev = nullrev
>> + self._dirs = {}
>> + self._blob_cache = {}
>> +
>> + @property
>> + def root_tree_sha(self):
>> + return self._dirs[''].id
>> +
>> + def update_changeset(self, ctx):
>> + """Set the tree to track a new Mercurial changeset.
>> +
>> + This is a generator of dulwich Git objects. Each returned object
>> can be
>> + added to a Git store via add_object(). Some objects may already
>> exist
>> + in the Git repository. Emitted objects are either Blob or Tree
>> + instances.
>> +
>> + Emitted objects are those that have changed since the last call
>> to
>> + update_changeset.
>> + """
>> + # In theory we should be able to look at changectx.files(). This
>> is
>> + # *much* faster. However, it may not be accurate, especially with
>> older
>> + # repositories, which may not record things like deleted files
>> + # explicitly in the manifest (which is where files() gets its
>> data).
>> + # The only reliable way to get the full set of changes is by
>> looking at
>> + # the full manifest. And, the easy way to compare two manifests
>> is
>> + # localrepo.status().
>> +
>> + # The other members of status are only relevant when looking at
>> the
>> + # working directory.
>> + modified, added, removed = self._hg.status(self._rev,
>> ctx.rev())[0:3]
>> +
>> + for path in sorted(removed, key=len, reverse=True):
>> + d = os.path.dirname(path)
>> + tree = self._dirs.get(d, Tree())
>> +
>> + del tree[os.path.basename(path)]
>> +
>> + if not len(tree):
> On Mon, Sep 24, 2012 at 12:39 PM, Gregory Szorc <gregory.sz...@gmail.com> wrote:
>> Having a night to sleep on it, I don't like dropping files/blobs from the
>> map file. Not yet, anyway. Let me refactor this a little bit to add them
>> back in. This should be a pretty minor and non-invasive change. So, I don't
>> think you'll waste time looking at the other code before I get around to
>> creating new new version of the patch.
[...]
> An luck at creating a new version of this patch?
Yeah, I would love to know about that, too. A set of patches making hg-git faster is of course something many people would love to see :). It seems Gregory did some work on his patches for a few days after the above email (looking at <https://github.com/indygreg/hg-git/commits/performance-master>), and I would just love to know what the state is, and if one can help... Anyway, I hope Gregory will find some time to work on cleaning this up and re-submitting :).
> On Mon, Sep 24, 2012 at 12:39 PM, Gregory Szorc <gregory.sz...@gmail.com> wrote:
>> Having a night to sleep on it, I don't like dropping files/blobs from the
>> map file. Not yet, anyway. Let me refactor this a little bit to add them
>> back in. This should be a pretty minor and non-invasive change. So, I don't
>> think you'll waste time looking at the other code before I get around to
>> creating new new version of the patch.
>> An luck at creating a new version of this patch?
I believe so, yes. I need to find some time to clean it up and resubmit it for consideration. I've been extremely busy with other projects. The latest version of the code is living at https://github.com/indygreg/hg-git/tree/performance-next for anyone who is interested. It probably needs rebased againt the latest tree.