Patch: Regular expression support for convert extension's 'filemap' option.

26 views
Skip to first unread message

Martin Blais

unread,
Jul 6, 2010, 8:14:57 PM7/6/10
to merc...@selenic.com
Hi,

Here is a patch that adds support for regular expressions in
the convert command's 'filemap' option.

I was attempting to convert a two-year old Subversion
repository which unfortunately has had unnecessary large
files committed to it in the past (e.g. Java jar files). I
needed to exclude these files by name, but the filenames
varied over the history of the SVN repo; what I needed was to
be able to say "exclude all the Java jar files and the .so"
and so on. The 'filemap' option allows one to exclude files
by name, but not by pattern.

This patch solves that problem by adding a new matching
expression 'exclude_re' to the filemap format which
interprets its filename as a regular expression::

exclude_re .*\\.(jar|zip|gz|tgz|bz2|tar|as|cs|exe|so|a|dll|swf|swc|swz|png|xsd|jpg|jpeg|gif|ttf|mp3|fla|pdb|pdf|pem|vcproj|html|chm|ppt|sxi|log)$
exclude_re .*\\.so\\..*$
exclude bin
exclude ThirdParty

I frankly don't see how I could have converted the team from
the Subversion repository without this patch; without
excluding the large files from the history, the converted
mercurial repository is upwards of 6GB which is too large. By
excluding the large files I brought it down to 350MB (which
is a very reasonable size given the project).

(The patch is trivial; I'm extremely busy and this was going
to get lost in the day-to-day grind, but I figured someone
might find it attractive enough and merge it in, so here it is
on the mailing-list. I normally don't monitor the list.)

I love Mercurial! Keep up the amazing work.
cheers,


(Why we need this patch)

Because it makes importing old repositories with large
files possible. This expands the usable domain of
Mercurial. I could not have converted this repository
without it.


(How you've implemented it)

I've modified a single file: 'hgext/convert/filemap.py' to
add a new recognized pattern: 'exclude_re'.


(What file formats and data structures you've used)

Similar to what was there.


(What choices you've made)

Implemented a regexp variant of the lookup function in that
module.


(Why the choices you've made are the right ones)

I kept it as simple as possible.


(Why the choices you didn't make are the wrong ones)

N/A


(What shortcomings exist)

I did not implement the corresponding 'include_re', it
would make sense to do so.


(What compatibility issues exist)

I've only added a new exclude patter to the filemap, did
not remove any. The file format should support all the
previous 'filemap commands' and be backwards compatible.
This should have no impact on compat.


(What's missing, if anything )

An option to exclude files by size would also be useful
(i.e., exclude if size is larger than X), but it was
non-trivial to implement, and this did the job wonderfully.


(Testing)

I ran the test suite against hg-stable using 'make tests'.
All tests pass. (I did not add a new test, however.)

tangerine:~/src/hg-stable$ make tests
cd tests && python run-tests.py
................................s......................s............s.......sss.........s.....................................................................................................................................................................................s.................................................................................................................
Skipped test-casefolding: missing feature: case insensitive file system
Skipped test-convert-baz: missing feature: GNU Arch baz client
Skipped test-convert-darcs: missing feature: darcs client
Skipped test-convert-mtn: missing feature: monotone client (> 0.31)
Skipped test-convert-p4: missing feature: Perforce server and client
Skipped test-convert-p4-filetypes: missing feature: Perforce server and client
Skipped test-convert-tla: missing feature: GNU Arch tla client
Skipped test-no-symlinks: system supports symbolic links
# Ran 384 tests, 8 skipped, 0 failed.
tangerine:~/src/hg-stable$

Below is the patch:
--------------------------------------------------------------------------------


util02:~/src/hg-stable$ /usr/bin/hg export -r 10791
# HG changeset patch
# User Martin Blais <bl...@furius.ca>
# Date 1275675257 14400
# Node ID 594f38d73da1829badca578b0881f1cf64e564c5
# Parent efd3b71fc29315e79a29033fdd0d149b309eb398
Added support for regular expressions.

diff -r efd3b71fc293 -r 594f38d73da1 hgext/convert/filemap.py
--- a/hgext/convert/filemap.py Thu Mar 04 13:10:48 2010 +0100
+++ b/hgext/convert/filemap.py Fri Jun 04 14:14:17 2010 -0400
@@ -4,7 +4,7 @@
# This software may be used and distributed according to the terms of the
# GNU General Public License version 2 or any later version.

-import shlex
+import shlex, re
from mercurial.i18n import _
from mercurial import util
from common import SKIPREV, converter_source
@@ -25,6 +25,7 @@
self.ui = ui
self.include = {}
self.exclude = {}
+ self.exclude_re = []
self.rename = {}
if path:
if self.parse(path):
@@ -51,6 +52,9 @@
errs += check(name, self.include, 'include')
errs += check(name, self.rename, 'rename')
self.exclude[name] = name
+ elif cmd == 'exclude_re':
+ regexp = lex.get_token()
+ self.exclude_re.append(re.compile(regexp))
elif cmd == 'rename':
src = lex.get_token()
dest = lex.get_token()
@@ -73,15 +77,24 @@
pass
return '', name, ''

+ def lookup_re(self, name, remapping):
+ for pre, suf in rpairs(name):
+ if any(r.match(pre) for r in remapping):
+ return pre, pre, suf
+ return '', name, ''
+
def __call__(self, name):
if self.include:
inc = self.lookup(name, self.include)[0]
else:
inc = name
+
+ exc = ''
if self.exclude:
exc = self.lookup(name, self.exclude)[0]
- else:
- exc = ''
+ if self.exclude_re:
+ exc = self.lookup_re(name, self.exclude_re)[0]
+
if (not self.include and exc) or (len(inc) <= len(exc)):
return None
newpre, pre, suf = self.lookup(name, self.rename)
@@ -94,7 +107,7 @@
return name

def active(self):
- return bool(self.include or self.exclude or self.rename)
+ return bool(self.include or self.exclude or self.exclude_re or self.rename)

# This class does two additional things compared to a regular source:
#
util02:~/src/hg-stable$

_______________________________________________
Mercurial mailing list
Merc...@selenic.com
http://selenic.com/mailman/listinfo/mercurial

Nicolas Dumazet

unread,
Jul 7, 2010, 2:00:39 AM7/7/10
to Martin Blais, merc...@selenic.com
Hello

2010/7/7 Martin Blais <bl...@furius.ca>:


> Hi,
>
> Here is a patch that adds support for regular expressions in
> the convert command's 'filemap' option.

Nice idea.

Maybe it would be nicer to support relre / relglob patterns, in an
hgignore-fashion?

> I frankly don't see how I could have converted the team from
> the Subversion repository without this patch; without
> excluding the large files from the history, the converted
> mercurial repository is upwards of 6GB which is too large. By
> excluding the large files I brought it down to 350MB (which
> is a very reasonable size given the project).

Sensible difference! :)

>
> (The patch is trivial; I'm extremely busy and this was going
> to get lost in the day-to-day grind, but I figured someone
> might find it attractive enough and merge it in, so here it is
> on the mailing-list. I normally don't monitor the list.)

Thanks for sharing it.
Indeed, it will need some work to comply to mercurial coding style (no
underscores, use util.any instead of __builtin__.any, etc...) but it's
a useful starting point.

>
> I love Mercurial! Keep up the amazing work.
> cheers,
>
>
>

>  I ran the test suite against hg-stable using 'make tests'.


>  All tests pass. (I did not add a new test, however.)

then someone will need to add one :)

Regards,

--
Nicolas Dumazet — NicDumZ

Martin Geisler

unread,
Jul 7, 2010, 3:58:02 AM7/7/10
to Nicolas Dumazet, Martin Blais, merc...@selenic.com
Nicolas Dumazet <nic...@gmail.com> writes:

> Hello
>
> 2010/7/7 Martin Blais <bl...@furius.ca>:
>> Hi,
>>
>> Here is a patch that adds support for regular expressions in
>> the convert command's 'filemap' option.
>
> Nice idea.
>
> Maybe it would be nicer to support relre / relglob patterns, in an
> hgignore-fashion?

Please also see this recent thread about adding support for glob
patterns to the convert extension:

http://markmail.org/message/mwjc37unpbvzmqka

We should definitely have this functionality! We have some
infrastructure in place already in the match module for dealing with
regexp and glob patterns, so we can preferably reuse that. Something like


exclude foo/bar # normal patch
include re:.*\.jpg$ # regexp
include glob:*.png # glob

See 'hg help patterns' for some syntax examples.

I'm actually wondering if we could just make the path a glob pattern by
default? That would of course require people to escape * and ?. However,
it seems more natural to me to write

exclude *.class

than

exclude glob:*.class

(or is it glob:**.class?)

--
Martin Geisler

aragost Trifork
Professional Mercurial support
http://aragost.com/mercurial/

Mads Kiilerich

unread,
Jul 8, 2010, 9:55:26 AM7/8/10
to Martin Geisler, Martin Blais, merc...@selenic.com
I just posted a patch that tries to make the documentation in this
area more clear. Feel free to improve so we know what we have before we
even consider changing it.

And:

Martin Geisler wrote, On 07/07/2010 09:58 AM:
> Please also see this recent thread about adding support for glob
> patterns to the convert extension:
>
> http://markmail.org/message/mwjc37unpbvzmqka
>
> We should definitely have this functionality! We have some
> infrastructure in place already in the match module for dealing with
> regexp and glob patterns, so we can preferably reuse that. Something like
>
>
> exclude foo/bar # normal patch
> include re:.*\.jpg$ # regexp
> include glob:*.png # glob
>
> See 'hg help patterns' for some syntax examples.

That would make it harder to detect ambiguous include/excludes and to
control the precedence. How would you do that while staying
("sufficiently") backward compatible?

While we are discussing this area: If we support regexp in
include/exclude then it is a reasonable expectation that we also do it
in rename, with proper use of references to matching groups...

/Mads

Reply all
Reply to author
Forward
0 new messages