Trying to Figure Out .hgignore Patterns

Steve Hollasch

unread,

Sep 15, 2010, 8:00:20 PM9/15/10

to merc...@selenic.com

I'm having a heck of a time trying to get simple file patterns to work in my .hgignore file. Suppose I want to ignore all files named "foo", and I have the following in my directory:

* foo
foox
xfoo
* xyy/foo
xyy/foox
xyy/xfoo
* xyz/foo/foo
xyz/foo/foox
xyz/foo/xfoo
* xyz/foox/foo
xyz/foox/foox
xyz/foox/xfoo
* xyz/xfoo/foo
xyz/xfoo/foox
xyz/xfoo/xfoo

Is there a syntax that will allow me to match this? I can't find any rule or set of rules that will yield this pattern. Here are things I've tried:

glob "foo"
Matches directories named "foo". File globs don't appear to have anything
that would prevent matches on files with a same-named subdirectory in their
path.

regexp "/foo$"
Oddly enough, the terminator $ matches against interior '/' characters.
Thus, this pattern matches "xyz/foo/foox".

I've read the man page as well as the wiki entry, but I'm just not getting it. Surely there's a way to do this, isn't there?

I wish Mercurial supported ellipsis-style path matching; I believe Perforce uses this for depot specifications. Then the pattern would be a simple ".../foo", and it would just work.

Anyway, am I missing something here? Ignore pattern matching seems very clumsy otherwise.
_______________________________________________
Mercurial mailing list
Merc...@selenic.com
http://selenic.com/mailman/listinfo/mercurial

Steve Hollasch

unread,

Sep 15, 2010, 8:21:50 PM9/15/10

to Andrew J. Leer, merc...@selenic.com

The issue here isn't the regular expression per se, it's the way that Mercurial interprets the ^ and $ tokens. In the case below, the meat of the expression is the fixed string "foo". The problem is that ^ matches the beginning of the path, while $ matches the end of any subdirectory. So, for example, if Mercurial interpreted '$' as the end of the path, I could reliably target leaves by ending the pattern with a '$'. Indeed, that's what a standard regular expression match would yield (such as the tool you mention below). However, the problem with Mercurial is that it doesn't seem to perform a true regular expression match, but instead iterates across multiple paths of successive depth.

For example, if the relative path of a given target is "xyz/foo/foox", and the pattern is "foo$", then the following matches will be tested:

"foo$" in "xyz" (false)
"foo$" in "xyz/foo" (true)
"foo$" in "xyz/foo/foox" (false)

Since the target strings change only in their endings, '^' is consistent from match to match, while '$' is not. Thus, one cannot reliably match against leaves in a path.

-----Original Message-----
From: leea...@gmail.com [mailto:leea...@gmail.com] On Behalf Of Andrew J. Leer
Sent: Wednesday, 15 Sep 2010 17:09
To: Steve Hollasch
Subject: Re: Trying to Figure Out .hgignore Patterns

Use one of the many online Regular Expression Testing tools to figure it out first:

http://www.gskinner.com/RegExr/
http://regexpal.com/

Steve Hollasch

unread,

Sep 15, 2010, 8:49:03 PM9/15/10

to Andrew J. Leer, merc...@selenic.com

Yeah, it pretty much backs up what I suspected. By testing patterns against the OR of a series of subdirectories, Mercurial has lost the ability to distinguish between files and directories. If it didn't perform this convenience trick, the regular expression matching would be much more useful. By that I mean that it could perform every match possible today, plus additional matches that I believe are very important (like matching files and not directories).

I'd hope that Mercurial's .hgignore format would include an option in the future to turn this behavior off and allow full regular expression matching. It would be both less surprising and also more powerful. Alternatively, introduce a new token that means what '$' usually means.

Does anybody know the rationale for this design? Is it to make regular expressions work more like file globs?

-----Original Message-----
From: leea...@gmail.com [mailto:leea...@gmail.com] On Behalf Of Andrew J. Leer
Sent: Wednesday, 15 Sep 2010 17:29
To: Steve Hollasch
Subject: Re: Trying to Figure Out .hgignore Patterns

Try looking at this:

http://stackoverflow.com/questions/1048628/hgignore-syntax-for-ignoring-only-files-not-directories

Mads Kiilerich

unread,

Sep 15, 2010, 9:00:13 PM9/15/10

to Steve Hollasch, merc...@selenic.com

Steve Hollasch wrote, On 09/16/2010 02:21 AM:
> The issue here isn't the regular expression per se, it's the way that Mercurial interprets the ^ and $ tokens. In the case below, the meat of the expression is the fixed string "foo". The problem is that ^ matches the beginning of the path, while $ matches the end of any subdirectory. So, for example, if Mercurial interpreted '$' as the end of the path, I could reliably target leaves by ending the pattern with a '$'. Indeed, that's what a standard regular expression match would yield (such as the tool you mention below). However, the problem with Mercurial is that it doesn't seem to perform a true regular expression match, but instead iterates across multiple paths of successive depth.
>
> For example, if the relative path of a given target is "xyz/foo/foox", and the pattern is "foo$", then the following matches will be tested:
>
> "foo$" in "xyz" (false)
> "foo$" in "xyz/foo" (true)
> "foo$" in "xyz/foo/foox" (false)
>
> Since the target strings change only in their endings, '^' is consistent from match to match, while '$' is not. Thus, one cannot reliably match against leaves in a path.
>

Mercurial works as intended and documented - at least in this case. See
the hgignore man page - http://www.selenic.com/mercurial/hgignore.5.html :

"An untracked file is ignored if its path relative to the repository
root directory, or any prefix path of that path, is matched against any
pattern in .hgignore."

Yes, that means that you can't ignore all files named "foo" without also
ignoring everything in directories called "foo". That is obviously
unfortunate if that is exactly what you want to do, but in most cases
that isn't a problem. Do you _really_ need that?

FWIW I also would prefer an option for using a simple regexp match on
the full paths. (Obviously it should have smart handling of all the
necessary and nice-to-have special cases.) It is theoretically possible
that such a new format could be added to Mercurial - it just requires
that someone comes up with a sufficiently clean design and
implementation and can convince us that it would be worth adding.

But for now I guess
(^|/)foo$
is as good as it gets.

Alternatively you can do something like
find * -name foo -type f -printf '^%p$\n' > .hgignore

/Mads

Kevin Bullock

unread,

Sep 16, 2010, 12:14:09 PM9/16/10

to Mads Kiilerich, merc...@selenic.com

On Sep 15, 2010, at 7:00 PM, Steve Hollasch wrote:

regexp "/foo$"
Oddly enough, the terminator $ matches against interior '/' characters.
Thus, this pattern matches "xyz/foo/foox".

On Sep 15, 2010, at 8:00 PM, Mads Kiilerich wrote:

FWIW I also would prefer an option for using a simple regexp match on the full paths. (Obviously it should have smart handling of all the necessary and nice-to-have special cases.) It is theoretically possible that such a new format could be added to Mercurial - it just requires that someone comes up with a sufficiently clean design and implementation and can convince us that it would be worth adding.

I'm not familiar with the internals, but I would guess that the issue is not that '$' is matching interior characters, but rather that when scanning for files to ignore, a directory named 'foo' comes up and gets considered _on its own_. That would be what lets you ignore whole directories.

So rather than add another format, you'd be doing something like adding an additional flag on a pattern to say, "only match regular files". Sounds overly complex to me, given the workarounds presented earlier in the thread.

pacem in terris / mir / shanti / salaam / heiwa
Kevin R. Bullock

Steve Hollasch

unread,

Sep 16, 2010, 4:24:21 PM9/16/10

to Kevin Bullock, merc...@selenic.com

Kevin Bullock <kbullock+...@ringworld.org> wrote:
|
| FWIW I also would prefer an option for using a simple regexp match
| on the full paths. (Obviously it should have smart handling of all
| the necessary and nice-to-have special cases.) It is theoretically
| possible that such a new format could be added to Mercurial - it
| just requires that someone comes up with a sufficiently clean
| design and implementation and can convince us that it would be
| worth adding.
|
| I'm not familiar with the internals, but I would guess that the
| issue is not that '$' is matching interior characters, but rather
| that when scanning for files to ignore, a directory named 'foo'
| comes up and gets considered _on its own_. That would be what lets
| you ignore whole directories.
|
|
| So rather than add another format, you'd be doing something like
| adding an additional flag on a pattern to say, "only match regular
| files". Sounds overly complex to me, given the workarounds
| presented earlier in the thread.

Here's a copy of a message I send to Mads last night that suggests a
low-impact fix for this. It essentially boils down to what you're
suggesting, but it still allows to match either files or directories
(or both). For my current thinking, you can just skip to the *** below.

Mads Kilerich <ma...@kiilerich.com> wrote:
| If I were to attack this then I would prefer to not introduce
| another new syntax to Mercurial and the world.

Steve Hollasch <st...@hollasch.net> wrote:
| Well, it's an existing syntax, but it is new to Mercurial, and I
| believe you have a valid point in resisting the addition of yet
| another syntax.

Mads Kilerich <ma...@kiilerich.com> wrote:
| Note also the main point you stumbled upon: that it also must be
| possible to match directories - if we only match on files then we
| can't do early pruning and would have to recurse down into ignored
| subdirectories for no reason.

Steve Hollasch <st...@hollasch.net> wrote:
| This syntax would match files and directories equally. For example,
| ".../obj/..." would match all files and directories with "obj" in
| their path. ".../obj" would match any file or directory, just as it
| does today, but it would *not* match "a/b/obj/bar".
|
| Regarding early pruning, regular expression and ellipsis-style
| matching are equivalent. The regular expression "abc" requires you to
| traverse the entire tree, just as "...abc" would.

Mads Kilerich <ma...@kiilerich.com> wrote:
| I notice this syntax is glob-ish. That is probably fine for most
| users, but I would (also) like something more plain regexp-ish. I
| think the notion of ** for matching several directories is more common
| than ...

Steve Hollasch <st...@hollasch.net> wrote:
| I have to agree, I prefer regular expressions myself. You could
| interpret '.' to mean any character but '/', and then introduce a new
| token that meant any character *including* '/'. But I'd be very
| reluctant to start messing with your regular expression syntax like
| that. And on reflection, I think there's a better and less intrusive
| solution.
|
|
| The main problem as I see it is that the current implementation
| changes the meaning of the standard regular expression '$' token. The
| way that Mercurial handles the '$' token today is *not* as an end of
| line (the standard interpretation), but as a wildcard match up to the
| earliest of either the next slash or the end of string
|
| Put another way, Mercurial effectively replaces the "$" token with
| "($|/)". It's this reinterpretation of the standard regular expression
| token that causing problems. You could address this by introducing a
| new non-standard token that actually acts like the regular expression
| '$' token, but that brings us back to messing around with regular
| expression syntax.
|
| ***
|
| Perhaps the best solution would be to introduce a new syntax tag that
| effectively treats '$' in the standard manner - something like
| "regexp-std", which interprets '$' as end of string (file path).
|
| After that, the only remaining nit is that you have to contort a bit
| to match file patterns that exist at the root as well as deeper. For
| example, to match file "foo" anywhere in the tree, you'd need to
| specify "(^|/)foo$".
|
| To prune entire sub trees, you do something like "/obj/". Names that
| end with '/' must be directory names, names that end with '$' must be
| file names, and names without either could match both.
|
|
| So anyway, the bottom line is that a new syntax tag that causes
| regular expression '$' tokens to match only against the true end of
| file path would (1) be straight-forward to implement, (2) work with
| existing .hgignore files, (3) leave syntax unchanged, (4) function
| more like true regular expressions, and (5) provide a good deal more
| expressiveness in the pattern matching.

Mads Kiilerich

unread,

Sep 16, 2010, 5:20:52 PM9/16/10

to Steve Hollasch, merc...@selenic.com

Steve Hollasch wrote, On 09/16/2010 10:24 PM:
> | The main problem as I see it is that the current implementation
> | changes the meaning of the standard regular expression '$' token. The
> | way that Mercurial handles the '$' token today is *not* as an end of
> | line (the standard interpretation), but as a wildcard match up to the
> | earliest of either the next slash or the end of string
> |
> | Put another way, Mercurial effectively replaces the "$" token with
> | "($|/)". It's this reinterpretation of the standard regular expression
> | token that causing problems. You could address this by introducing a
> | new non-standard token that actually acts like the regular expression
> | '$' token, but that brings us back to messing around with regular
> | expression syntax.

Please understand that Mercurial _do_ use standard regular expressions
and don't change anything. Your description of what you think Mercurial
does seems to be based on a misunderstanding of _what_ strings Mercurial
applies the regexps on. _That_ is the main problem ;-)

Listen: Mercurial applies the ignore regexpes both to the directory
paths it encounters while it traverses the working directory and to the
files it meets.

Or in other words: An untracked file is ignored if its path relative to

the repository root directory, or any prefix path of that path, is
matched against any pattern in .hgignore.

A (slightly unfortunate, IMHO) consequence of that is that it isn't
possible to create a pattern that distinguishes directories from files.

I would have preferred if the code was something like this:

--- a/mercurial/dirstate.py
+++ b/mercurial/dirstate.py
@@ -569,7 +569,7 @@
nf = normalize(nd and (nd + "/" + f) or f, True)
if nf not in results:
if kind == dirkind:
- if not ignore(nf):
+ if not ignore(nf + '/'):
match.dir(nf)
wadd(nf)
if nf in dmap and matchfn(nf):

BUT that is not how it is, and changing it now will not be backward
compatible and is thus probably not an option.

And note again: The "issue" here is not so much the syntax of the
patterns but what strings they are tested on. A new pattern syntax is
thus not the solution.

I think it would be great if you could propose a patch to
doc/hgignore.5.txt to make it more clear.

/Mads

Steve Hollasch

unread,

Sep 16, 2010, 6:08:32 PM9/16/10

to Mads Kiilerich, merc...@selenic.com

Steve Hollasch wrote, On 09/16/2010 10:24 PM:
| The main problem as I see it is that the current implementation
| changes the meaning of the standard regular expression '$' token.
| The way that Mercurial handles the '$' token today is *not* as an
| end of line (the standard interpretation), but as a wildcard match
| up to the earliest of either the next slash or the end of string
|
| Put another way, Mercurial effectively replaces the "$" token with
| "($|/)". It's this reinterpretation of the standard regular
| expression token that causing problems. You could address this by
| introducing a new non-standard token that actually acts like the
| regular expression '$' token, but that brings us back to messing
| around with regular expression syntax.

Mads Kiilerich <ma...@kiilerich.com> wrote:
| Please understand that Mercurial _do_ use standard regular expressions
| and don't change anything. Your description of what you think
| Mercurial does seems to be based on a misunderstanding of _what_
| strings Mercurial applies the regexps on. _That_ is the main problem
| ;-)
|
| Listen: Mercurial applies the ignore regexpes both to the directory
| paths it encounters while it traverses the working directory and to
| the files it meets.
|
| Or in other words: An untracked file is ignored if its path relative
| to the repository root directory, or any prefix path of that path, is
| matched against any pattern in .hgignore.
|
| A (slightly unfortunate, IMHO) consequence of that is that it isn't
| possible to create a pattern that distinguishes directories from
| files.

Sorry, I guess I wasn't clear enough. When I said that Mercurial
*effectively* interprets the '$' token differently, I was trying to
make the point that there are two *equivalent* ways of describing
Mercurial pattern matching.

Most users' mental model is that they have a relative file path and a
given .hgignore file pattern. If they just use the pattern against the
file path, they'll encounter situations where they're surprised by the
results, because that's not how Mercurial performs the match.

Consider a pattern "/foo$" and a file path "a/b/c/d/e/foo/bar".

The standard way of describing the current match behavior is to say
that Mercurial performs the OR of the following matches:

"/foo$" against "a" // false
"/foo$" against "a/b" // false
"/foo$" against "a/b/c" // false
"/foo$" against "a/b/c/d" // false
"/foo$" against "a/b/c/d/e" // false
"/foo$" against "a/b/c/d/e/foo" // true
"/foo$" against "a/b/c/d/e/foo/bar" // false

I argue that this is EXACTLY EQUIVALENT to saying that Mercurial first
substitutes '$' in the pattern with '($|/)', and then performs the
following SINGLE match:

"/foo($|/)" against "a/b/c/d/e/foo/bar" // true

In your model, there's no reinterpretation of the regular expression.
In my model, there is. However, both ways yield identical results. The
model I illustrate above reveals multiple unfortunate side effect
limitations of the current implementation (*effectively*, the loss of
the '$' operator when matched only against the set of all file paths
in the tree).

Mads Kiilerich <ma...@kiilerich.com> wrote:
| I would have preferred if the code was something like this:
|
| --- a/mercurial/dirstate.py
| +++ b/mercurial/dirstate.py
| @@ -569,7 +569,7 @@
| nf = normalize(nd and (nd + "/" + f) or f, True)
| if nf not in results:
| if kind == dirkind:
| - if not ignore(nf):
| + if not ignore(nf + '/'):
| match.dir(nf)
| wadd(nf)
| if nf in dmap and matchfn(nf):
|
| BUT that is not how it is, and changing it now will not be backward
| compatible and is thus probably not an option.

Adding a new syntax variant of 'regexp' will be fully backward
compatible. Or do you mean that old clients will barf on new .hgignore
files?

Mads Kiilerich <ma...@kiilerich.com> wrote:
| And note again: The "issue" here is not so much the syntax of the
| patterns but what strings they are tested on. A new pattern syntax is
| thus not the solution.

Yes, we're in complete agreement on both points, although I AM
proposing a new syntax NAME in order to control the set of match
candidates.

Mads Kiilerich <ma...@kiilerich.com> wrote:
| I think it would be great if you could propose a patch to
| doc/hgignore.5.txt to make it more clear.

Sure -- where would I submit the proposed changes?

Martin Geisler

unread,

Sep 16, 2010, 6:26:21 PM9/16/10

to Steve Hollasch, merc...@selenic.com

Steve Hollasch <stev...@microsoft.com> writes:

Everybody -- please keep the discussion on the list and use a standard
quoting style.

> Consider a pattern "/foo$" and a file path "a/b/c/d/e/foo/bar".
>
> The standard way of describing the current match behavior is to say
> that Mercurial performs the OR of the following matches:
>
> "/foo$" against "a" // false
> "/foo$" against "a/b" // false
> "/foo$" against "a/b/c" // false
> "/foo$" against "a/b/c/d" // false
> "/foo$" against "a/b/c/d/e" // false
> "/foo$" against "a/b/c/d/e/foo" // true
> "/foo$" against "a/b/c/d/e/foo/bar" // false

It does not make a difference for the argument you are making here, but
I just want to make it clear that the final test is not done since the
OR short-circuits the testing when it finds a matching prefix.

So Mercurial will never enter the 'foo' directory in order to save time
stat'ing lots of files which are to be ignored. and it will thus never
encounter the full filename.

> I argue that this is EXACTLY EQUIVALENT to saying that Mercurial first
> substitutes '$' in the pattern with '($|/)', and then performs the
> following SINGLE match:
>
> "/foo($|/)" against "a/b/c/d/e/foo/bar" // true
>
> In your model, there's no reinterpretation of the regular expression.
> In my model, there is. However, both ways yield identical results.

Agreed.

> The model I illustrate above reveals multiple unfortunate side effect
> limitations of the current implementation (*effectively*, the loss of
> the '$' operator when matched only against the set of all file paths
> in the tree).

Yes, the '$' operator is useless in our regular expressions.

--
Martin Geisler

Mercurial links: http://mercurial.ch/

Steve Hollasch

unread,

Sep 16, 2010, 6:27:24 PM9/16/10

to Danek Duvall, Mads Kiilerich, merc...@selenic.com

Mads Kiilerich wrote:
| Or in other words: An untracked file is ignored if its path relative
| to the repository root directory, or any prefix path of that path, is
| matched against any pattern in .hgignore.

Danek Duvall wrote:
| I guess that's a good reason why it's not possible to ignore a
| directory based on a particular filename it might contain. I was
| trying to do something like
|
| ^.*(?=/TotalChangedLines)
|
| to ignore any directory that might have a file "TotalChangedLines" in
| it, with no success. From your description, there seems to be no way
| to do that. I was also hoping to have an expression to eliminate all
| .tar.gz files as well as their unpacked directories, assuming the
| latter had the same name as the former, just with the ".tar.gz"
| extensions stripped off, but I assume that again, there's no way to do
| that.
|
| Am I the only one crazy enough to want something like this?

Thanks, that's another great example of the true power of matching the
regular expression against the set of all file paths in the tree.

I believe it's significant that your mental model is that you're
matching against this set, rather than against a particular set of
fragments of all file paths in the tree.

I believe that the current scheme misses the common mental model
(which surprises users), and reduces the power of regular expression
matching.

By the way, while I'm having fun with regular expression equivalences,
here's how you get equivalence from the current implementation of file
globs. Take the file glob string and

(1) Prefix the pattern with '(^|/)'
(2) Postfix the pattern with '($|/)'
(3) Replace '?' with '.'
(4) Replace '*' with '.*'

Now match the resulting pattern against the set of relative file paths
in your tree. Again, this is NOT how Mercurial implements the
matching, but it is EQUIVALENT.

Thus, WHEN YOUR MODEL IS THE SET OF ALL FILE PATHS IN YOUR TREE,
Mercurial file glob expressions are a subset of Mercurial regular
expressions, which are themselves a subset of standard regular
expressions.

Danek Duvall

unread,

Sep 16, 2010, 5:44:18 PM9/16/10

to Mads Kiilerich, merc...@selenic.com

Mads Kiilerich wrote:

> Or in other words: An untracked file is ignored if its path relative
> to the repository root directory, or any prefix path of that path, is
> matched against any pattern in .hgignore.

I guess that's a good reason why it's not possible to ignore a directory

based on a particular filename it might contain. I was trying to do
something like

^.*(?=/TotalChangedLines)

to ignore any directory that might have a file "TotalChangedLines" in it,
with no success. From your description, there seems to be no way to do
that. I was also hoping to have an expression to eliminate all .tar.gz
files as well as their unpacked directories, assuming the latter had the
same name as the former, just with the ".tar.gz" extensions stripped off,
but I assume that again, there's no way to do that.

Am I the only one crazy enough to want something like this?

Danek

Tony Mechelynck

unread,

Sep 16, 2010, 7:43:13 PM9/16/10

to Danek Duvall, merc...@selenic.com

Unpack them into some directory that is already ignored, or else,
outside the repo (e.g. in the parent of the "repo root" directory
containing your .hg directory). Similarly, the "object" directories
where the "make" program has my compilers write their linkable output,
are mentioned in .hgignore so Mercurial won't try to track any changes
in those object files.

Alternately, whenever you untar or unzip an archive into some
subdirectory of your repo, commit a change to your .hgignore listing
that new directory. But that might give you problems if you push from
that repo, or (which is equivalent) if someone pulls from it (they won't
get the ignored dir, but they will get the .hgignore change).

Oh, I forgot: you may have "custom" ignore files: see "man hgrc", in the
[ui] section, the "ignore" item. I suppose you might arrange to have
these custom ignore files not be tracked for changes, so they would be
hidden to outgoing transactions.

Or if your directories "to be ignored" have some kind of regularity to
their names, use wildcard-globs or well-chosen regexps to match that
regularity. But I guess you had already thought of that.

Best regards,
Tony.
--
ARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.
King of all Britons, defeator of the Saxons, sovereign of all
England!
[Pause]
SOLDIER: Get away!
"Monty Python and the Holy Grail" PYTHON (MONTY)
PICTURES LTD

Danek Duvall

unread,

Sep 16, 2010, 7:55:35 PM9/16/10

to Tony Mechelynck, merc...@selenic.com

Tony Mechelynck wrote:

> Unpack them into some directory that is already ignored, or else, outside
> the repo (e.g. in the parent of the "repo root" directory containing your
> .hg directory).

Yeah, that would work, too. Not sure why I hadn't thought of that for the
archive unpacking case, though I know for the other one, I specifically
wanted them directly in the root of the repo, but that's probably my own
personal quirk.

> Oh, I forgot: you may have "custom" ignore files: see "man hgrc", in the
> [ui] section, the "ignore" item. I suppose you might arrange to have
> these custom ignore files not be tracked for changes, so they would be
> hidden to outgoing transactions.

Yeah, but until the paths to the custom ignore files are relative to the
repo root, they're not terribly useful, at least if you ever expect the
repo to move, or be accessed from different directories (/net/machine/...).
I started implementing this, and had the unpack targets write
$REPO/.localignore, but then I had no good way of referencing it.

> Or if your directories "to be ignored" have some kind of regularity to
> their names, use wildcard-globs or well-chosen regexps to match that
> regularity. But I guess you had already thought of that.

Indeed.

Thanks,

Matt Mackall

unread,

Sep 16, 2010, 9:10:10 PM9/16/10

to Steve Hollasch, merc...@selenic.com

Well, no, it's not. If the skipped directory has 10000 files in it,
you've now just avoided 20000 system calls and 10000 matches. Which is
precisely why we match against files AND directories when filtering.

--
Mathematics is the supreme nostalgia of our time.

Steve Hollasch

unread,

Sep 16, 2010, 9:36:22 PM9/16/10

to Matt Mackall, merc...@selenic.com

Matt Mackall wrote:
> Well, no, it's not. If the skipped directory has 10000 files in it,
> you've now just avoided 20000 system calls and 10000 matches. Which is
> precisely why we match against files AND directories when filtering.

Yes, it is. :)

Let's address this, since I've seen it brought up several times, and I believe
it's a red herring.

First of all, there's no reason to slavishly implement the mental model whole
cloth. As I mentioned above, when discussing pattern matching, it's useful to
discuss the effect of various patterns without getting caught up in
implementation details.

Certainly, I could enumerate every single file in my tree, and then iterate over
my ignore patterns to exclude matches. There's nothing in my proposed solution,
though, that mandates this inefficient implementation.

Today, if you don't have directory name exclusions (or if you have an empty
.hgignore file), Mercurial is forced to enumerate the entire tree. Hopefully,
it's fast enough to handle these simple cases. Similarly, if your ignored
directories tend to be near the leaves (such as "obj" and other generated file
directories), then the savings of pruning your tree descent are minimal.
Forcing exclusions intended for files to also inadvertently match internal
directories isn't really a great way to get a performance gain, and if they
*don't* also match any internal directories, are a performance *hit*, as
they'll be matched against every single directory in the tree for no reason.

All that said, there's absolutely nothing about the changes I'm suggesting that
preclude subdirectory pruning -- all they do is to make the pattern intent
clear. If I have a rule to ignore "/obj/", this match will succeed as soon as
you descend to an "obj" directory, and you can safely abort your descent there.
If, however, I have a rule like "foo$", I'm simply telling you that I don't
*want* Mercurial to match this against subdirectory names.

Because the matching I'm suggesting is a pure superset of the matching as
currently implemented by Mercurial, there is no efficient pruning technique in
the current approach that cannot analogously be implemented my proposed
approach.

Tony Mechelynck

unread,

Sep 17, 2010, 10:19:05 PM9/17/10

to Steve Hollasch, merc...@selenic.com

On 17/09/10 03:36, Steve Hollasch wrote:
> On Thu, 2010-09-16 at 22:08 +0000, Steve Hollasch wrote:
>>> Steve Hollasch wrote, On 09/16/2010 10:24 PM:

[...]

Even if the ignored directories have few levels below them ("they are
near the leaves"), the savings are still important if there are many
files ("leaves") in them. Reread what was said above: if one directory
is ignored, then if it contains ten thousand files, those ten thousand
directory entries are not even looked at.

>
> All that said, there's absolutely nothing about the changes I'm suggesting that
> preclude subdirectory pruning -- all they do is to make the pattern intent
> clear. If I have a rule to ignore "/obj/", this match will succeed as soon as
> you descend to an "obj" directory, and you can safely abort your descent there.
> If, however, I have a rule like "foo$", I'm simply telling you that I don't
> *want* Mercurial to match this against subdirectory names.

[...]

Regardless of what you want, I gather from how all this thread has been
going that Mercurial is not going to change its behaviour:

* Starting at the repository top (the parent of the .hg directory),
every directory or file name in turn (but see below for what I mean by
"every name") is matched against every pattern in .hgignore. I gather
that you would want that matching to somehow distinguish files from
directories, but it doesn't; and the $ at the end of a regexp would be
matched against the end of a directory name just like it would against
the end of a file name (e.g. if one pattern is "foo$" [without the
quotes of course] any directory or file whose name ends in foo will be
ignored, but any directory or file named foobar won't, unless it matches
some other pattern: thus, contrary to what was said in some earlier
post, the regexp $ _can_ be useful in some cases).

* If a directory matches, that directory is ignored *with all its
contents at any depth* and Mercurial behaves as if that directory didn't
exist: none of its contents are even looked at, they are already ignored
by virtue of being in an ignored directory tree branch.

* Therefore, for any file that you want tracked, you must make sure that
no directory in its relative path from .hg/.. is matched by any
.hgignore pattern. How to construct your hgignore with this in mind will
of necessity vary from one repository to the next, but I believe that in
the vast majority of cases it should be possible (though not necessarily
"elegant", where elegance is measured by conciseness and paucity of rules).

Best regards,
Tony.
--
"If once a man indulges himself in murder, very soon he comes to think
little of robbing; and from robbing he next comes to drinking and
Sabbath-breaking, and from that to incivility and procrastination."
-- Thomas De Quincey (1785 - 1859)

Steve Hollasch

unread,

Sep 20, 2010, 1:17:48 PM9/20/10

to Tony Mechelynck, merc...@selenic.com

From: Tony Mechelynck [mailto:antoine.m...@gmail.com]

For every source tree, there is a set of files that the user wants to track, and
a set of files that they want ignored. That's it -- Mercurial isn't free to
"discover" some new set of files that it can ignore in the interest of
efficiency. It must exactly honor the configuration set forth by the user. If
the user wants everything below a particular directory to be ignored, then both
the current approach and the interpretation I'm proposing provide a way to do
that; the current approach has no advantage here (as I demonstrated above).

Making it difficult for users to express the set that can be ignored does not
help make things more efficient. If someone wants Mercurial to dive into a
directory, then it must do so. If the user specifies a directory to skip, then
Mercurial must do so. It's as simple as that.

Today the '$' token is interpreted in what several users (myself included)
consider to be an odd way, and in a way that makes inadvertant (and incorrect)
matching easy to do. If I want Mercurial to ignore files that match a pattern,
but not directories, then Mercurial hasn't saved any work by stopping at the
directory. The user will simply wrestle with the ignore file until Mercurial
delves into the subdirectory anyway. The workaround, ironically enough, is more
verbose and clumsy than the user would like, and MORE WORK for Mercurial to
process, since the ruleset must now include a brittle set of additional rules
that it will have to evaluate for every fraction of every directory in the tree.
This is not what I consider an optimization. No user ever walks away from a
situation where Mercurial ignores more files than intended.

Getting back again to the item of efficiency, the interpretation I'm proposing
is _more_ efficient, rather than less. The current approach tests every pattern
against every fraction of every directory in the tree. In my proposed approach,
if the pattern ends in '$', then test it only files in the tree, never against
subdirectories. It's both more expressive and more efficient.

Martin Geisler

unread,

Sep 20, 2010, 2:18:05 PM9/20/10

to Steve Hollasch, merc...@selenic.com

Hi,

To move forward, we need less text and more code... :)

As I understand it, the problem is that Mercurial does not distinguish between files and directories. I think it woould be very elegant to do so by including the final slash in the directory name - I think Mads suggested this already.

Please make a patch which we can review. The change will need to introduce a new syntax keyword in order to not break existing .hgignore files.

--
Martin Geisler

Steve Hollasch

unread,

Sep 20, 2010, 5:11:29 PM9/20/10

to Martin Geisler, merc...@selenic.com

I couldn’t agree more (assuming there are still people out there who haven’t put me on their kill list).

Is there a URL you send that will help get me started on developing a patch?

Thanks.

Adrian Buehlmann

unread,

Sep 21, 2010, 4:55:21 AM9/21/10

to Steve Hollasch, merc...@selenic.com

On 20.09.2010 23:11, Steve Hollasch wrote:
> I couldn’t agree more (assuming there are still people out there who
> haven’t put me on their kill list).

FWIW, I liked reading your reasoning. Moderately annoyed for seeing top
posting (see how others post to this list).

BTW, are you working on the CodePlex team? (just curious).

> Is there a URL you send that will help get me started on developing a patch?

Welcome to the club :)

I don't know your level of understanding about doing code contributions.
So I'll resort to referring you to the wiki start page at

http://mercurial.selenic.com/wiki/Mercurial

Under the heading "Further information" there is an item "DeveloperInfo
for Mercurial hackers". Clicking on "DeveloperInfo" will bring you to

http://mercurial.selenic.com/wiki/DeveloperInfo

The wiki has a very good search function that allows to search for both
page titles and text content. It is available on the top right of each page.

BTW, please be bold and edit the wiki if you find something that needs
improving (editing the wiki requires a login for yourself, just create one).

A good way of getting quick responses in some cases is using the irc
channel as mentioned on the wiki start page (#mercurial on
irc.freenode.net). And subscribe to the devel list.

The most important advice: don't give up and be patient.

Adrian Buehlmann

unread,

Sep 21, 2010, 6:02:05 PM9/21/10

to Steve Hollasch, Merc...@selenic.com

(Setting cc back to list and forwarding)

-------- Original Message --------
Subject: RE: Trying to Figure Out .hgignore Patterns
Date: Tue, 21 Sep 2010 21:49:23 +0000
From: Steve Hollasch <stev...@microsoft.com>
To: Adrian Buehlmann <adr...@cadifra.com>

From: Adrian Buehlmann [mailto:adr...@cadifra.com]
> On 20.09.2010 23:11, Steve Hollasch wrote:
> > I couldn't agree more (assuming there are still people out there who
> > haven't put me on their kill list).

> FWIW, I liked reading your reasoning. Moderately annoyed for seeing top
> posting (see how others post to this list).

Ha! You'll notice that when I remember, I try to use the old-fashioned Unix
style quoting. It's funny, it takes me back to 1980 composing my mail this way.
What I have to do now is copy the entire contents, fire up Vim, and try to
recreate bottom posting. It's taken me decades to slowly change my ways (I used
to hate top quoting, "HTML" mail, web page styling, anything over 80 columns,
IDE projects, and so forth), and now I have to reach back do something as simple
as reformat mail response style, and I feel thirty years younger. :)

I'm trying, man, I'm trying. :)

> BTW, are you working on the CodePlex team? (just curious).

No, I work in Live Labs (http://livelabs.com/). I keep meaning to start doing
open source work myself, but it's on The List, if you know what I mean.

> > Is there a URL you send that will help get me started on developing a patch?

> Welcome to the club :)
>

> ...

Thank you very much for the pointers -- I'm hoping to dig into it shortly. It'll
be my excuse to start learning Python, as soon as I'm sure it won't collide with
everything else I'm learning this year. Seems like a cool language. It might
even start to wean me from my beloved Perl.

I also plan on fleshing out the .hgignore page a bit more, and I'll also see
what I can add to the wiki.

Thanks again for the tips!

Tony Mechelynck

unread,

Sep 23, 2010, 10:08:10 PM9/23/10

to Adrian Buehlmann, merc...@selenic.com

Yeah, right. In most mutual-help channels (not only #mercurial and not
only on freenode), the routine goes as follows:

1. Don't ask if you may ask, ask straightaway.
2. Stay in the channel, several hours if necessary, maybe one of the
"sleepers" will wake up and reply even after that long a time,
especially if the talk in the channel isn't sweeping past very fast.
3. While you're waiting for your IRC client to tell you (e.g. by
flashing its taskbar icon) that your nick has been said, don't stay
idle: investigate other sources of help as they come to your mind.
4. While you wait in the channel, if you see someone asking a question
to which you know the answer, don't hesitate to answer it (starting with
the nick of the asker).

>
> _______________________________________________
> Mercurial mailing list
> Merc...@selenic.com
> http://selenic.com/mailman/listinfo/mercurial
>

Best regards,
Tony.
--
SECOND SOLDIER: It could be carried by an African swallow!
FIRST SOLDIER: Oh yes! An African swallow maybe ... but not a European
swallow. that's my point.

"Monty Python and the Holy Grail" PYTHON (MONTY)
PICTURES LTD

Reply all

Reply to author

Forward