Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Performance enhancements for exporting to Git
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  20 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Gregory Szorc  
View profile  
 More options Sep 23 2012, 8:04 pm
From: Gregory Szorc <gregory.sz...@gmail.com>
Date: Sun, 23 Sep 2012 17:03:29 -0700
Local: Sun, Sep 23 2012 8:03 pm
Subject: [PATCH 0 of 9] Performance enhancements for exporting to Git
This series of patches is mostly centered around making exporting from
Mercurial to Git faster.

The initial patches are very small. I'm just pre-compiling a bunch of
regular expressions. This is considered a best-practice and will save
some CPU resources. Although, you probably won't be able to tell on the
wall time of conversions.

I also snuck a patch in there that verifies the tree and parent commits
exist in the Git repository before saving a new Git commit object. This
adds minimal overhead and catches a pretty obvious case of repository
corruption.

The big patch is at the end. I changed how Git trees are exported. The
new code is well-documented, so I won't describe it that much here. Just
know that there is a slight behavior change: blob IDs are no longer
saved to the ID mapping. Instead, the first time TreeTracker is fired
up,
it will export a blob it hasn't seen before, possibly redundantly with
something that's in the Git repo already. This shouldn't matter: Git
will happily figure things out on the next pack.

The new code passes the test suite. And, conversion of the actual
Mercurial repository yields identical commit hashes with the patches
applied. The only difference is it runs about 3x faster. Conversion of
mozilla-central also runs about 3x faster with this patch.

My next series of patches will center around doing tree export in
parallel. This should make things scale up to the number of cores in
your machine. The end of this patch series is a good stopping point
before I make this transition.

In my initial series of patches mailed to this list, I added versioning
of hg-git state. I'll probably re-submit these patches at some point. I
/might/ be a good idea to commit them before this patch series since I
deprecated storing blobs in the mapping file.

The patches in this series should apply cleanly on top of the next
branch.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Optimize get_git_author" by Gregory Szorc
Gregory Szorc  
View profile  
 More options Sep 23 2012, 8:04 pm
From: Gregory Szorc <gregory.sz...@gmail.com>
Date: Sun, 23 Sep 2012 17:03:30 -0700
Local: Sun, Sep 23 2012 8:03 pm
Subject: [PATCH 1 of 9] Optimize get_git_author
# HG changeset patch
# User Gregory Szorc <gregory.sz...@gmail.com>
# Date 1348280926 25200
# Node ID 5ca256907196d908a23dc72bf3230e975565d6e8
# Parent  e152bdf5998098e135d15affd2a98c357d323b3f
Optimize get_git_author

Pre-compile regular expression. Prevent extra key lookup in author_map.

diff --git a/hggit/git_handler.py b/hggit/git_handler.py
--- a/hggit/git_handler.py
+++ b/hggit/git_handler.py
@@ -24,16 +24,18 @@
 from mercurial.node import hex, bin, nullid
 from mercurial import context, util as hgutil
 from mercurial import error

 import _ssh
 import util
 from overlay import overlayrepo

+RE_GIT_AUTHOR = re.compile('^(.*?) ?\<(.*?)(?:\>(.*))?$')
+
 class GitProgress(object):
     """convert git server progress strings into mercurial progress"""
     def __init__(self, ui):
         self.ui = ui

         self.lasttopic = None
         self.msgbuf = ''

@@ -428,22 +430,20 @@
         """
         return re.sub('[<>\n]', '?', name.lstrip('< ').rstrip('> '))

     def get_git_author(self, ctx):
         # hg authors might not have emails
         author = ctx.user()

         # see if a translation exists
-        if author in self.author_map:
-            author = self.author_map[author]
+        author = self.author_map.get(author, author)

         # check for git author pattern compliance
-        regex = re.compile('^(.*?) ?\<(.*?)(?:\>(.*))?$')
-        a = regex.match(author)
+        a = RE_GIT_AUTHOR.match(author)

         if a:
             name = self.get_valid_git_username_email(a.group(1))
             email = self.get_valid_git_username_email(a.group(2))
             if a.group(3) != None and len(a.group(3)) != 0:
                 name += ' ext:(' + urllib.quote(a.group(3)) + ')'
             author = self.get_valid_git_username_email(name) + ' <' + self.get_valid_git_username_email(email) + '>'
         elif '@' in author:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Precompile Git URI regular expression" by Gregory Szorc
Gregory Szorc  
View profile  
 More options Sep 23 2012, 8:04 pm
From: Gregory Szorc <gregory.sz...@gmail.com>
Date: Sun, 23 Sep 2012 17:03:31 -0700
Local: Sun, Sep 23 2012 8:03 pm
Subject: [PATCH 2 of 9] Precompile Git URI regular expression
# HG changeset patch
# User Gregory Szorc <gregory.sz...@gmail.com>
# Date 1348281136 25200
# Node ID 58a14700666b1eb51d13bee03ce411da149104a7
# Parent  5ca256907196d908a23dc72bf3230e975565d6e8
Precompile Git URI regular expression

diff --git a/hggit/git_handler.py b/hggit/git_handler.py
--- a/hggit/git_handler.py
+++ b/hggit/git_handler.py
@@ -26,16 +26,23 @@
 from mercurial import error

 import _ssh
 import util
 from overlay import overlayrepo

 RE_GIT_AUTHOR = re.compile('^(.*?) ?\<(.*?)(?:\>(.*))?$')

+# Test for git:// and git+ssh:// URI.
+# Support several URL forms, including separating the
+# host and path with either a / or : (sepr)
+RE_GIT_URI = re.compile(
+    r'^(?P<scheme>git([+]ssh)?://)(?P<host>.*?)(:(?P<port>\d+))?'
+    r'(?P<sepr>[:/])(?P<path>.*)$')
+
 class GitProgress(object):
     """convert git server progress strings into mercurial progress"""
     def __init__(self, ui):
         self.ui = ui

         self.lasttopic = None
         self.msgbuf = ''

@@ -1288,24 +1295,17 @@
         except UnicodeDecodeError:
             return string.decode('ascii', 'replace').encode('utf-8')

     def get_transport_and_path(self, uri):
         # pass hg's ui.ssh config to dulwich
         if not issubclass(client.get_ssh_vendor, _ssh.SSHVendor):
             client.get_ssh_vendor = _ssh.generate_ssh_vendor(self.ui)

-        # Test for git:// and git+ssh:// URI.
-        #  Support several URL forms, including separating the
-        #  host and path with either a / or : (sepr)
-        git_pattern = re.compile(
-            r'^(?P<scheme>git([+]ssh)?://)(?P<host>.*?)(:(?P<port>\d+))?'
-            r'(?P<sepr>[:/])(?P<path>.*)$'
-        )
-        git_match = git_pattern.match(uri)
+        git_match = RE_GIT_URI.match(uri)
         if git_match:
             res = git_match.groupdict()
             transport = client.SSHGitClient if 'ssh' in res['scheme'] else client.TCPGitClient
             host, port, sepr, path = res['host'], res['port'], res['sepr'], res['path']
             if sepr == '/':
                 path = '/' + path
             # strip trailing slash for heroku-style URLs
             # ssh+git://...@heroku.com:project.git/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Precompile Git username sanitizing regular expression" by Gregory Szorc
Gregory Szorc  
View profile  
 More options Sep 23 2012, 8:04 pm
From: Gregory Szorc <gregory.sz...@gmail.com>
Date: Sun, 23 Sep 2012 17:03:32 -0700
Local: Sun, Sep 23 2012 8:03 pm
Subject: [PATCH 3 of 9] Precompile Git username sanitizing regular expression
# HG changeset patch
# User Gregory Szorc <gregory.sz...@gmail.com>
# Date 1348281417 25200
# Node ID 1de6cd07221e0d5fb90b30e515891a325574a9fc
# Parent  58a14700666b1eb51d13bee03ce411da149104a7
Precompile Git username sanitizing regular expression

diff --git a/hggit/git_handler.py b/hggit/git_handler.py
--- a/hggit/git_handler.py
+++ b/hggit/git_handler.py
@@ -26,16 +26,18 @@
 from mercurial import error

 import _ssh
 import util
 from overlay import overlayrepo

 RE_GIT_AUTHOR = re.compile('^(.*?) ?\<(.*?)(?:\>(.*))?$')

+RE_GIT_SANITIZE_AUTHOR = re.compile('[<>\n]')
+
 # Test for git:// and git+ssh:// URI.
 # Support several URL forms, including separating the
 # host and path with either a / or : (sepr)
 RE_GIT_URI = re.compile(
     r'^(?P<scheme>git([+]ssh)?://)(?P<host>.*?)(:(?P<port>\d+))?'
     r'(?P<sepr>[:/])(?P<path>.*)$')

 class GitProgress(object):
@@ -430,17 +432,17 @@
         'j...@doe.com'
         >>> g(' <j...@doe.com> ')
         'j...@doe.com'
         >>> g('    <random<\n<garbage\n>  > > ')
         'random???garbage?'
         >>> g('Typo in hgrc >but.hg-...@handles.it.gracefully>')
         'Typo in hgrc ?but.hg-...@handles.it.gracefully'
         """
-        return re.sub('[<>\n]', '?', name.lstrip('< ').rstrip('> '))
+        return RE_GIT_SANITIZE_AUTHOR.sub('?', name.lstrip('< ').rstrip('> '))

     def get_git_author(self, ctx):
         # hg authors might not have emails
         author = ctx.user()

         # see if a translation exists
         author = self.author_map.get(author, author)


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Precompile Git author extra data regular expression" by Gregory Szorc
Gregory Szorc  
View profile  
 More options Sep 23 2012, 8:04 pm
From: Gregory Szorc <gregory.sz...@gmail.com>
Date: Sun, 23 Sep 2012 17:03:33 -0700
Local: Sun, Sep 23 2012 8:03 pm
Subject: [PATCH 4 of 9] Precompile Git author extra data regular expression
# HG changeset patch
# User Gregory Szorc <gregory.sz...@gmail.com>
# Date 1348281593 25200
# Node ID fbb9ade686ffdd66ed30a7e9b68d9246409f7389
# Parent  1de6cd07221e0d5fb90b30e515891a325574a9fc
Precompile Git author extra data regular expression

diff --git a/hggit/git_handler.py b/hggit/git_handler.py
--- a/hggit/git_handler.py
+++ b/hggit/git_handler.py
@@ -28,16 +28,18 @@
 import _ssh
 import util
 from overlay import overlayrepo

 RE_GIT_AUTHOR = re.compile('^(.*?) ?\<(.*?)(?:\>(.*))?$')

 RE_GIT_SANITIZE_AUTHOR = re.compile('[<>\n]')

+RE_GIT_AUTHOR_EXTRA = re.compile('^(.*?)\ ext:\((.*)\) <(.*)\>$')
+
 # Test for git:// and git+ssh:// URI.
 # Support several URL forms, including separating the
 # host and path with either a / or : (sepr)
 RE_GIT_URI = re.compile(
     r'^(?P<scheme>git([+]ssh)?://)(?P<host>.*?)(:(?P<port>\d+))?'
     r'(?P<sepr>[:/])(?P<path>.*)$')

 class GitProgress(object):
@@ -702,18 +704,17 @@
         text = '\n'.join([l.rstrip() for l in text.splitlines()]).strip('\n')
         if text + '\n' != origtext:
             extra['message'] = create_delta(text +'\n', origtext)

         author = commit.author

         # convert extra data back to the end
         if ' ext:' in commit.author:
-            regex = re.compile('^(.*?)\ ext:\((.*)\) <(.*)\>$')
-            m = regex.match(commit.author)
+            m = RE_GIT_AUTHOR_EXTRA.match(commit.author)
             if m:
                 name = m.group(1)
                 ex = urllib.unquote(m.group(2))
                 email = m.group(3)
                 author = name + ' <' + email + '>' + ex

         if ' <none@none>' in commit.author:
             author = commit.author[:-12]


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Precompile Git progress regular expressions" by Gregory Szorc
Gregory Szorc  
View profile  
 More options Sep 23 2012, 8:04 pm
From: Gregory Szorc <gregory.sz...@gmail.com>
Date: Sun, 23 Sep 2012 17:03:34 -0700
Local: Sun, Sep 23 2012 8:03 pm
Subject: [PATCH 5 of 9] Precompile Git progress regular expressions
# HG changeset patch
# User Gregory Szorc <gregory.sz...@gmail.com>
# Date 1348281744 25200
# Node ID df4f8b7f800ff2a4c8a0c8e30575f38d5ea45fcb
# Parent  fbb9ade686ffdd66ed30a7e9b68d9246409f7389
Precompile Git progress regular expressions

diff --git a/hggit/git_handler.py b/hggit/git_handler.py
--- a/hggit/git_handler.py
+++ b/hggit/git_handler.py
@@ -37,39 +37,42 @@

 # Test for git:// and git+ssh:// URI.
 # Support several URL forms, including separating the
 # host and path with either a / or : (sepr)
 RE_GIT_URI = re.compile(
     r'^(?P<scheme>git([+]ssh)?://)(?P<host>.*?)(:(?P<port>\d+))?'
     r'(?P<sepr>[:/])(?P<path>.*)$')

+RE_NEWLINES = re.compile('[\r\n]')
+RE_GIT_PROGRESS = re.compile('\((\d+)/(\d+)\)')
+
 class GitProgress(object):
     """convert git server progress strings into mercurial progress"""
     def __init__(self, ui):
         self.ui = ui

         self.lasttopic = None
         self.msgbuf = ''

     def progress(self, msg):
         # 'Counting objects: 33640, done.\n'
         # 'Compressing objects:   0% (1/9955)   \r
-        msgs = re.split('[\r\n]', self.msgbuf + msg)
+        msgs = RE_NEWLINES.split(self.msgbuf + msg)
         self.msgbuf = msgs.pop()

         for msg in msgs:
             td = msg.split(':', 1)
             data = td.pop()
             if not td:
                 self.flush(data)
                 continue
             topic = td[0]

-            m = re.search('\((\d+)/(\d+)\)', data)
+            m = RE_GIT_PROGRESS.search(data)
             if m:
                 if self.lasttopic and self.lasttopic != topic:
                     self.flush()
                 self.lasttopic = topic

                 pos, total = map(int, m.group(1, 2))
                 util.progress(self.ui, topic, pos, total=total)
             else:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Precompile author file regular expression" by Gregory Szorc
Gregory Szorc  
View profile  
 More options Sep 23 2012, 8:04 pm
From: Gregory Szorc <gregory.sz...@gmail.com>
Date: Sun, 23 Sep 2012 17:03:35 -0700
Local: Sun, Sep 23 2012 8:03 pm
Subject: [PATCH 6 of 9] Precompile author file regular expression
# HG changeset patch
# User Gregory Szorc <gregory.sz...@gmail.com>
# Date 1348281830 25200
# Node ID 95b937230a1352d738c81bbe1e3b3a031e071956
# Parent  df4f8b7f800ff2a4c8a0c8e30575f38d5ea45fcb
Precompile author file regular expression

diff --git a/hggit/git_handler.py b/hggit/git_handler.py
--- a/hggit/git_handler.py
+++ b/hggit/git_handler.py
@@ -40,16 +40,18 @@
 # host and path with either a / or : (sepr)
 RE_GIT_URI = re.compile(
     r'^(?P<scheme>git([+]ssh)?://)(?P<host>.*?)(:(?P<port>\d+))?'
     r'(?P<sepr>[:/])(?P<path>.*)$')

 RE_NEWLINES = re.compile('[\r\n]')
 RE_GIT_PROGRESS = re.compile('\((\d+)/(\d+)\)')

+RE_AUTHOR_FILE = re.compile('\s*=\s*')
+
 class GitProgress(object):
     """convert git server progress strings into mercurial progress"""
     def __init__(self, ui):
         self.ui = ui

         self.lasttopic = None
         self.msgbuf = ''

@@ -120,17 +122,17 @@
         if self.ui.config('git', 'authors'):
             with open(self.repo.wjoin(
                 self.ui.config('git', 'authors')
             )) as f:
                 for line in f:
                     line = line.strip()
                     if not line or line.startswith('#'):
                         continue
-                    from_, to = re.split(r'\s*=\s*', line, 2)
+                    from_, to = RE_AUTHOR_FILE.split(line, 2)
                     self.author_map[from_] = to

     ## FILE LOAD AND SAVE METHODS

     def map_set(self, gitsha, hgsha):
         self._map_git[gitsha] = hgsha
         self._map_hg[hgsha] = gitsha


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Verify tree and parent objects are in Git repo" by Gregory Szorc
Gregory Szorc  
View profile  
 More options Sep 23 2012, 8:04 pm
From: Gregory Szorc <gregory.sz...@gmail.com>
Date: Sun, 23 Sep 2012 17:03:36 -0700
Local: Sun, Sep 23 2012 8:03 pm
Subject: [PATCH 7 of 9] Verify tree and parent objects are in Git repo
# HG changeset patch
# User Gregory Szorc <gregory.sz...@gmail.com>
# Date 1348284386 25200
# Node ID 2db03c124dde9c84de1006526f497d867094a231
# Parent  95b937230a1352d738c81bbe1e3b3a031e071956
Verify tree and parent objects are in Git repo

When exporting Git commits, verify that the tree and parents objects
exist in the repository before allowing the commit to be exported. If a
tree or parent commit is missing, then the repository is not valid and
the export should not be allowed.

diff --git a/hggit/git_handler.py b/hggit/git_handler.py
--- a/hggit/git_handler.py
+++ b/hggit/git_handler.py
@@ -379,24 +379,32 @@
             commit.commit_time = commit.author_time
             commit.commit_timezone = commit.author_timezone

         commit.parents = []
         for parent in self.get_git_parents(ctx):
             hgsha = hex(parent.node())
             git_sha = self.map_git_get(hgsha)
             if git_sha:
+                if git_sha not in self.git.object_store:
+                    raise hgutil.Abort(_('Parent SHA-1 not present in Git'
+                                         'repo: %s' % git_sha))
+
                 commit.parents.append(git_sha)

         commit.message = self.get_git_message(ctx)

         if 'encoding' in extra:
             commit.encoding = extra['encoding']

         tree_sha = commit_tree(self.git.object_store, self.iterblobs(ctx))
+        if tree_sha not in self.git.object_store:
+            raise hgutil.Abort(_('Tree SHA-1 not present in Git repo: %s' %
+                tree_sha))
+
         commit.tree = tree_sha

         self.git.object_store.add_object(commit)
         self.map_set(commit.id, ctx.hex())

         self.swap_out_encoding(oldenc)
         return commit.id


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Make get_valid_git_username_email static" by Gregory Szorc
Gregory Szorc  
View profile  
 More options Sep 23 2012, 8:04 pm
From: Gregory Szorc <gregory.sz...@gmail.com>
Date: Sun, 23 Sep 2012 17:03:37 -0700
Local: Sun, Sep 23 2012 8:03 pm
Subject: [PATCH 8 of 9] Make get_valid_git_username_email static
# HG changeset patch
# User Gregory Szorc <gregory.sz...@gmail.com>
# Date 1348284630 25200
# Node ID ef583ac939de39b80aaff2d1d3d9f47bf1a1c9f3
# Parent  2db03c124dde9c84de1006526f497d867094a231
Make get_valid_git_username_email static

Also alias where it is used to make code a little easier to read.

diff --git a/hggit/git_handler.py b/hggit/git_handler.py
--- a/hggit/git_handler.py
+++ b/hggit/git_handler.py
@@ -403,17 +403,18 @@
         commit.tree = tree_sha

         self.git.object_store.add_object(commit)
         self.map_set(commit.id, ctx.hex())

         self.swap_out_encoding(oldenc)
         return commit.id

-    def get_valid_git_username_email(self, name):
+    @staticmethod
+    def get_valid_git_username_email(name):
         r"""Sanitize usernames and emails to fit git's restrictions.

         The following is taken from the man page of git's fast-import
         command:

             [...] Likewise LF means one (and only one) linefeed [...]

             committer
@@ -435,17 +436,17 @@
         angle brackets and spaces from the beginning, and right angle
         brackets and spaces from the end, of this string, to convert
         such things as " <j...@doe.com> " to "j...@doe.com" for
         convenience.

         TESTS:

         >>> from mercurial.ui import ui
-        >>> g = GitHandler('', ui()).get_valid_git_username_email
+        >>> g = GitHandler.get_valid_git_username_email
         >>> g('John Doe')
         'John Doe'
         >>> g('j...@doe.com')
         'j...@doe.com'
         >>> g(' <j...@doe.com> ')
         'j...@doe.com'
         >>> g('    <random<\n<garbage\n>  > > ')
         'random???garbage?'
@@ -459,26 +460,28 @@
         author = ctx.user()

         # see if a translation exists
         author = self.author_map.get(author, author)

         # check for git author pattern compliance
         a = RE_GIT_AUTHOR.match(author)

+        get_valid = GitHandler.get_valid_git_username_email
+
         if a:
-            name = self.get_valid_git_username_email(a.group(1))
-            email = self.get_valid_git_username_email(a.group(2))
+            name = get_valid(a.group(1))
+            email = get_valid(a.group(2))
             if a.group(3) != None and len(a.group(3)) != 0:
                 name += ' ext:(' + urllib.quote(a.group(3)) + ')'
-            author = self.get_valid_git_username_email(name) + ' <' + self.get_valid_git_username_email(email) + '>'
+            author = get_valid(name) + ' <' + get_valid(email) + '>'
         elif '@' in author:
-            author = self.get_valid_git_username_email(author) + ' <' + self.get_valid_git_username_email(author) + '>'
+            author = get_valid(author) + ' <' + get_valid(author) + '>'
         else:
-            author = self.get_valid_git_username_email(author) + ' <none@none>'
+            author = get_valid(author) + ' <none@none>'

         if 'author' in ctx.extra():
             author = "".join(apply_delta(author, ctx.extra()['author']))

         return author

     def get_git_parents(self, ctx):
         def is_octopus_part(ctx):


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Implement TreeTracker for incremental tree calculation" by Gregory Szorc
Gregory Szorc  
View profile  
 More options Sep 23 2012, 8:04 pm
From: Gregory Szorc <gregory.sz...@gmail.com>
Date: Sun, 23 Sep 2012 17:03:38 -0700
Local: Sun, Sep 23 2012 8:03 pm
Subject: [PATCH 9 of 9] Implement TreeTracker for incremental tree calculation
# HG changeset patch
# User Gregory Szorc <gregory.sz...@gmail.com>
# Date 1348422117 25200
# Node ID 85c4b8e2e129975f400c9810eb9bf6ce6fea4c8b
# Parent  ef583ac939de39b80aaff2d1d3d9f47bf1a1c9f3
Implement TreeTracker for incremental tree calculation

This class makes exporting Mercurial changesets to Git much faster.

diff --git a/hggit/git_handler.py b/hggit/git_handler.py
--- a/hggit/git_handler.py
+++ b/hggit/git_handler.py
@@ -1,13 +1,12 @@
 import os, math, urllib, re
 import stat, posixpath, StringIO

 from dulwich.errors import HangupException, GitProtocolError, UpdateRefsError
-from dulwich.index import commit_tree
 from dulwich.objects import Blob, Commit, Tag, Tree, parse_timezone, S_IFGITLINK
 from dulwich.pack import create_delta, apply_delta
 from dulwich.repo import Repo
 from dulwich import client
 from dulwich import config as dul_config

 try:
     from mercurial import bookmarks
@@ -24,16 +23,18 @@
 from mercurial.node import hex, bin, nullid
 from mercurial import context, util as hgutil
 from mercurial import error

 import _ssh
 import util
 from overlay import overlayrepo

+from .hg2git import TreeTracker
+
 RE_GIT_AUTHOR = re.compile('^(.*?) ?\<(.*?)(?:\>(.*))?$')

 RE_GIT_SANITIZE_AUTHOR = re.compile('[<>\n]')

 RE_GIT_AUTHOR_EXTRA = re.compile('^(.*?)\ ext:\((.*)\) <(.*)\>$')

 # Test for git:// and git+ssh:// URI.
 # Support several URL forms, including separating the
@@ -323,32 +324,35 @@
     def export_git_objects(self):
         self.init_if_missing()

         nodes = [self.repo.lookup(n) for n in self.repo]
         export = [node for node in nodes if not hex(node) in self._map_hg]
         total = len(export)
         if total:
             self.ui.status(_("exporting hg objects to git\n"))
+
+        tracker = TreeTracker(self.repo)
+
         for i, rev in enumerate(export):
             util.progress(self.ui, 'exporting', i, total=total)
             ctx = self.repo.changectx(rev)
             state = ctx.extra().get('hg-git', None)
             if state == 'octopus':
                 self.ui.debug("revision %d is a part "
                               "of octopus explosion\n" % ctx.rev())
                 continue
-            self.export_hg_commit(rev)
+            self.export_hg_commit(rev, tracker)
         util.progress(self.ui, 'importing', None, total=total)

     # convert this commit into git objects
     # go through the manifest, convert all blobs/trees we don't have
     # write the commit object (with metadata info)
-    def export_hg_commit(self, rev):
+    def export_hg_commit(self, rev, tracker):
         self.ui.note(_("converting revision %s\n") % hex(rev))

         oldenc = self.swap_out_encoding()

         ctx = self.repo.changectx(rev)
         extra = ctx.extra()

         commit = Commit()
@@ -390,17 +394,21 @@

                 commit.parents.append(git_sha)

         commit.message = self.get_git_message(ctx)

         if 'encoding' in extra:
             commit.encoding = extra['encoding']

-        tree_sha = commit_tree(self.git.object_store, self.iterblobs(ctx))
+        for obj in tracker.update_changeset(ctx):
+            self.git.object_store.add_object(obj)
+
+        tree_sha = tracker.root_tree_sha
+
         if tree_sha not in self.git.object_store:
             raise hgutil.Abort(_('Tree SHA-1 not present in Git repo: %s' %
                 tree_sha))

         commit.tree = tree_sha

         self.git.object_store.add_object(commit)
         self.map_set(commit.id, ctx.hex())
@@ -536,53 +544,16 @@
                 add_extras = True
                 extra_message += "extra : " + key + " : " +  urllib.quote(value) + "\n"

         if add_extras:
             message += "\n--HG--\n" + extra_message

         return message

-    def iterblobs(self, ctx):
-        if '.hgsubstate' in ctx:
-            hgsub = util.OrderedDict()
-            if '.hgsub' in ctx:
-                hgsub = util.parse_hgsub(ctx['.hgsub'].data().splitlines())
-            hgsubstate = util.parse_hgsubstate(ctx['.hgsubstate'].data().splitlines())
-            for path, sha in hgsubstate.iteritems():
-                try:
-                    if path in hgsub and not hgsub[path].startswith('[git]'):
-                        # some other kind of a repository (e.g. [hg])
-                        # that keeps its state in .hgsubstate, shall ignore
-                        continue
-                    yield path, sha, S_IFGITLINK
-                except ValueError:
-                    pass
-
-        for f in ctx:
-            if f == '.hgsubstate' or f == '.hgsub':
-                continue
-            fctx = ctx[f]
-            blobid = self.map_git_get(hex(fctx.filenode()))
-
-            if not blobid:
-                blob = Blob.from_string(fctx.data())
-                self.git.object_store.add_object(blob)
-                self.map_set(blob.id, hex(fctx.filenode()))
-                blobid = blob.id
-
-            if 'l' in ctx.flags(f):
-                mode = 0120000
-            elif 'x' in ctx.flags(f):
-                mode = 0100755
-            else:
-                mode = 0100644
-
-            yield f, blobid, mode
-
     def getnewgitcommits(self, refs=None):
         self.init_if_missing()

         # import heads and fetched tags as remote references
         todo = []
         done = set()
         convert_list = {}

diff --git a/hggit/hg2git.py b/hggit/hg2git.py
new file mode 100644
--- /dev/null
+++ b/hggit/hg2git.py
@@ -0,0 +1,205 @@
+# This file contains code dealing specifically with converting Mercurial
+# repositories to Git repositories. Code in this file is meant to be a generic
+# library and should be usable outside the context of hg-git or an hg command.
+
+import os
+import stat
+
+from dulwich.objects import Blob
+from dulwich.objects import S_IFGITLINK
+from dulwich.objects import TreeEntry
+from dulwich.objects import Tree
+
+from mercurial import error as hgerror
+from mercurial.node import nullrev
+
+from . import util
+
+class TreeTracker(object):
+    """Tracks Git tree objects across Mercurial revisions.
+
+    The purpose of this class is to facilitate Git tree export that is more
+    optimal than brute force. The tree calculation part of this class is
+    essentially a reimplementation of dulwich.index.commit_tree. However, since
+    our implementation reuses Tree instances and only recalculates SHA-1 when
+    things change, we are much more efficient.
+
+    Callers instantiate this class against a mercurial.localrepo instance. They
+    then associate the tracker with a specific changeset by calling
+    update_changeset(). That function emits Git objects that need to be
+    exported to a Git repository. Callers then typically obtain the
+    root_tree_sha and use that as part of assembling a Git commit.
+    """
+
+    def __init__(self, hg_repo):
+        self._hg = hg_repo
+        self._rev = nullrev
+        self._dirs = {}
+        self._blob_cache = {}
+
+    @property
+    def root_tree_sha(self):
+        return self._dirs[''].id
+
+    def update_changeset(self, ctx):
+        """Set the tree to track a new Mercurial changeset.
+
+        This is a generator of dulwich Git objects. Each returned object can be
+        added to a Git store via add_object(). Some objects may already exist
+        in the Git repository. Emitted objects are either Blob or Tree
+        instances.
+
+        Emitted objects are those that have changed since the last call to
+        update_changeset.
+        """
+        # In theory we should be able to look at changectx.files(). This is
+        # *much* faster. However, it may not be accurate, especially with older
+        # repositories, which may not record things like deleted files
+        # explicitly in the manifest (which is where files() gets its data).
+        # The only reliable way to get the full set of changes is by looking at
+        # the full manifest. And, the easy way to compare two manifests is
+        # localrepo.status().
+
+        # The other members of status are only relevant when looking at the
+        # working directory.
+        modified, added, removed = self._hg.status(self._rev, ctx.rev())[0:3]
+
+        for path in sorted(removed, key=len, reverse=True):
+            d = os.path.dirname(path)
+            tree = self._dirs.get(d, Tree())
+
+            del tree[os.path.basename(path)]
+
+            if not len(tree):
+                self._remove_tree(d)
+                continue
+
+            self._dirs[d] = tree
+
+        for path in sorted(set(modified) | set(added), key=len, reverse=True):
+            if path == '.hgsubstate':
+                self._handle_subrepos(ctx)
+                continue
+
+            if path == '.hgsub':
+                continue
+
+            d = os.path.dirname(path)
+            tree = self._dirs.get(d, Tree())
+
+            fctx = ctx[path]
+
+            entry, blob = TreeTracker.tree_entry(fctx, self._blob_cache)
+            if blob is not None:
+                yield blob
+
+            tree.add(*entry)
+            self._dirs[d] = tree
+
+        for obj in self._populate_tree_entries():
+            yield obj
+
+        self._rev = ctx.rev()
+
+    def _remove_tree(self, path):
+        try:
+            del self._dirs[path]
+        except KeyError:
+            return
+
+        # Now we traverse up to the parent and delete any references.
+        if path == '':
+            return
+
+        basename = os.path.basename(path)
+        parent = os.path.dirname(path)
+        while True:
+            tree = self._dirs.get(parent, None)
+
+            # No parent entry. Nothing to
...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Performance enhancements for exporting to Git" by David M. Carr
David M. Carr  
View profile  
 More options Sep 24 2012, 9:02 am
From: "David M. Carr" <davidm...@gmail.com>
Date: Mon, 24 Sep 2012 09:02:11 -0400
Local: Mon, Sep 24 2012 9:02 am
Subject: Re: [PATCH 0 of 9] Performance enhancements for exporting to Git

Thanks for submitting these patches.  Better performance exporting to
Git is definitely something I've been wishing for.  I'll try to take a
closer look at them tonight.

--
David M. Carr
davidm...@gmail.com


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Make get_valid_git_username_email static" by Augie Fackler
Augie Fackler  
View profile  
 More options Sep 24 2012, 9:56 am
From: Augie Fackler <r...@durin42.com>
Date: Mon, 24 Sep 2012 08:56:42 -0500
Local: Mon, Sep 24 2012 9:56 am
Subject: Re: [PATCH 8 of 9] Make get_valid_git_username_email static

On Sep 23, 2012, at 7:03 PM, Gregory Szorc <gregory.sz...@gmail.com> wrote:

> # HG changeset patch
> # User Gregory Szorc <gregory.sz...@gmail.com>
> # Date 1348284630 25200
> # Node ID ef583ac939de39b80aaff2d1d3d9f47bf1a1c9f3
> # Parent  2db03c124dde9c84de1006526f497d867094a231
> Make get_valid_git_username_email static

Rather than making this a static method, can we make it a module-level free function? Is there a reason not to?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Performance enhancements for exporting to Git" by Augie Fackler
Augie Fackler  
View profile  
 More options Sep 24 2012, 9:58 am
From: Augie Fackler <r...@durin42.com>
Date: Mon, 24 Sep 2012 08:58:18 -0500
Local: Mon, Sep 24 2012 9:58 am
Subject: Re: [PATCH 0 of 9] Performance enhancements for exporting to Git
queued 1-7, dropped last two.

In the future, I'd appreciate commit messages along the lines of hg's style:

component: something succinct

Free form text here.

eg

git_handler: optimize get_git_author

Thanks!

On Sep 23, 2012, at 7:03 PM, Gregory Szorc <gregory.sz...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Gregory Szorc  
View profile  
 More options Sep 24 2012, 12:33 pm
From: Gregory Szorc <gregory.sz...@gmail.com>
Date: Mon, 24 Sep 2012 09:33:25 -0700
Local: Mon, Sep 24 2012 12:33 pm
Subject: Re: [PATCH 0 of 9] Performance enhancements for exporting to Git
On 9/24/2012 6:58 AM, Augie Fackler wrote:

> queued 1-7, dropped last two.

> In the future, I'd appreciate commit messages along the lines of hg's style:

> component: something succinct

> Free form text here.

> eg

> git_handler: optimize get_git_author

Will do (this is my first set of patches to hg-git or Mercurial).

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Make get_valid_git_username_email static" by Gregory Szorc
Gregory Szorc  
View profile  
 More options Sep 24 2012, 12:35 pm
From: Gregory Szorc <gregory.sz...@gmail.com>
Date: Mon, 24 Sep 2012 09:34:58 -0700
Local: Mon, Sep 24 2012 12:34 pm
Subject: Re: [PATCH 8 of 9] Make get_valid_git_username_email static
On 9/24/2012 6:56 AM, Augie Fackler wrote:

> On Sep 23, 2012, at 7:03 PM, Gregory Szorc <gregory.sz...@gmail.com> wrote:

>> # HG changeset patch
>> # User Gregory Szorc <gregory.sz...@gmail.com>
>> # Date 1348284630 25200
>> # Node ID ef583ac939de39b80aaff2d1d3d9f47bf1a1c9f3
>> # Parent  2db03c124dde9c84de1006526f497d867094a231
>> Make get_valid_git_username_email static
> Rather than making this a static method, can we make it a module-level free function? Is there a reason not to?

If that's the style you prefer, I see no reason why it can't be a
module-level function.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Implement TreeTracker for incremental tree calculation" by Gregory Szorc
Gregory Szorc  
View profile  
 More options Sep 24 2012, 12:39 pm
From: Gregory Szorc <gregory.sz...@gmail.com>
Date: Mon, 24 Sep 2012 09:39:14 -0700 (PDT)
Local: Mon, Sep 24 2012 12:39 pm
Subject: Re: [PATCH 9 of 9] Implement TreeTracker for incremental tree calculation

Having a night to sleep on it, I don't like dropping files/blobs from the
map file. Not yet, anyway. Let me refactor this a little bit to add them
back in. This should be a pretty minor and non-invasive change. So, I don't
think you'll waste time looking at the other code before I get around to
creating new new version of the patch.

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Performance enhancements for exporting to Git" by David M. Carr
David M. Carr  
View profile  
 More options Sep 26 2012, 3:30 am
From: "David M. Carr" <davidm...@gmail.com>
Date: Wed, 26 Sep 2012 03:30:19 -0400
Local: Wed, Sep 26 2012 3:30 am
Subject: Re: [PATCH 0 of 9] Performance enhancements for exporting to Git
On Mon, Sep 24, 2012 at 9:02 AM, David M. Carr <davidm...@gmail.com> wrote:

I've now had a chance to look at these further.  I can confirm that
with the patches applied, all-version-tests still passes for me.
Likewise, I can confirm that patch 9 appears to provide a nice
performance boost.  I attempted to measure the performance impact of
patches 1-8, but wasn't able to measure any significant differences.
From a code review perspective, I don't have anything to add to what
Augie has already said.

--
David M. Carr
davidm...@gmail.com


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Implement TreeTracker for incremental tree calculation" by David M. Carr
David M. Carr  
View profile  
 More options Nov 2 2012, 4:39 pm
From: "David M. Carr" <da...@carrclan.us>
Date: Fri, 2 Nov 2012 16:39:05 -0400
Local: Fri, Nov 2 2012 4:39 pm
Subject: Re: [PATCH 9 of 9] Implement TreeTracker for incremental tree calculation

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Max Horn  
View profile  
 More options Nov 13 2012, 5:49 am
From: Max Horn <post...@quendi.de>
Date: Tue, 13 Nov 2012 11:49:03 +0100
Local: Tues, Nov 13 2012 5:49 am
Subject: Re: [PATCH 9 of 9] Implement TreeTracker for incremental tree calculation

On 02.11.2012, at 21:39, David M. Carr wrote:

> On Mon, Sep 24, 2012 at 12:39 PM, Gregory Szorc <gregory.sz...@gmail.com> wrote:
>> Having a night to sleep on it, I don't like dropping files/blobs from the
>> map file. Not yet, anyway. Let me refactor this a little bit to add them
>> back in. This should be a pretty minor and non-invasive change. So, I don't
>> think you'll waste time looking at the other code before I get around to
>> creating new new version of the patch.

[...]

> An luck at creating a new version of this patch?

Yeah, I would love to know about that, too. A set of patches making hg-git faster is of course something many people would love to see :). It seems Gregory did some work on his patches for a few days after the above email (looking at <https://github.com/indygreg/hg-git/commits/performance-master>), and I would just love to know what the state is, and if one can help... Anyway, I hope Gregory will find some time to work on cleaning this up and re-submitting :).

Cheers,
Max


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Gregory Szorc  
View profile  
 More options Nov 18 2012, 6:13 pm
From: Gregory Szorc <gregory.sz...@gmail.com>
Date: Sun, 18 Nov 2012 15:13:48 -0800
Local: Sun, Nov 18 2012 6:13 pm
Subject: Re: [PATCH 9 of 9] Implement TreeTracker for incremental tree calculation
On 11/2/2012 1:39 PM, David M. Carr wrote:

> On Mon, Sep 24, 2012 at 12:39 PM, Gregory Szorc <gregory.sz...@gmail.com> wrote:
>> Having a night to sleep on it, I don't like dropping files/blobs from the
>> map file. Not yet, anyway. Let me refactor this a little bit to add them
>> back in. This should be a pretty minor and non-invasive change. So, I don't
>> think you'll waste time looking at the other code before I get around to
>> creating new new version of the patch.

>> An luck at creating a new version of this patch?

I believe so, yes. I need to find some time to clean it up and resubmit
it for consideration. I've been extremely busy with other projects. The
latest version of the code is living at
https://github.com/indygreg/hg-git/tree/performance-next for anyone who
is interested. It probably needs rebased againt the latest tree.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »