New -w

21 views
Skip to first unread message

Andy Lester

unread,
Mar 8, 2017, 2:30:14 PM3/8/17
to ack development
This behavior has been part of ack since at least 2.00.

218 if ( $opt->{w} ) {
1 $str = "\\b$str" if $str =~ /^\w/;
2 $str = "$str\\b" if $str =~ /\w$/;
3 }

which morphed into this:

318 if ( $opt->{w} ) {
1 my $pristine_str = $str;
2
3 $str = "(?:$str)";
4 $str = "\\b$str" if $pristine_str =~ /^\w/;
5 $str = "$str\\b" if $pristine_str =~ /\w$/;
6 }

I'm pulling out on the word branch, replacing it with this:

318 if ( $opt->{w} ) {
1 $str = "\\b(?:$str)\\b";
2 }

This changes some test cases in ack-w.t, and it may have some changes that you might not like, maybe.

ack -w "Sue!" will match "Sue!Foo" but not "Sue!*", because there's not a word boundary in the middle of "!*". I'm OK with that. Previously "ack -w Sue!" would match both "Sue!Foo" and "Sue!*".

The key is that it fixes "ack -w (sue)" matching "pursue", which the user cannot have wanted.

Anyone see downsides? Is there something we can use better than \b ?

--
Andy Lester => www.petdance.com

Bill Ricker

unread,
Mar 8, 2017, 3:44:37 PM3/8/17
to Andy Lester, ack development

On Wed, Mar 8, 2017 at 2:30 PM, Andy Lester <an...@petdance.com> wrote:
Anyone see downsides?  Is there something we can use better than \b ?

Yes.

Important edgecases (needing test cases) are
  * non-word char in text matching start (or end) of ​pattern
     * at first (or last) position in line
     * to match just after ( or before) whitespace.
     * (irrespective of how matched, i.e. pattern having explicit ' !'; or class '\W' or [:graph:];  or wild '.+' )

​\b is defined as 0-width assertion matching between the chars of pairs
that would match either \w\W or \W\w, (*)
"counting the imaginary characters off the beginning and end of the string as matching a \W ." ​
(emphasis supplied)
This covers the edgecases for ack -w 'Fred'  but not for ack -w 'Sue!' .

(*) pedantically matching lookarounds (?x: (?<= \w) (?= \W) |  (?<=\W) (?=\w) ) ) 

i believe your fix  is broken for pattern 'Sue!'  matching line "Sue!",  since end of line $ is treated as \W so there is NOT a \b match at end of line but rather \W(\W) imputed. So simple \b wrap won't work with terminal \W match though it does for a  pattern ending an a \w char.

[iirc the broken patch you are reverting was to fix that case ... and made !Sue! work but broke (Fred|Sue) iirc ? ]

I expect your simplified \b$PATTERN\b is also broken matching 'Sue!' against "Sue!  Fred!  "  since \s is a subset of ( \W  SET_MINUS  [:graph:]  )  and thus "Sue!  Fred" has \W\W where \b is wanted, will not match.

I think this asymmetry of when m{\b\W\b} matches vs m{\b\w\b} does  to be an annoying lack of foresight in original definition of \b.  It's defined re \w and \W without considering [:graph:][:space:] to be a \b in the case when the [:graph:] is not also [:word:], and without considering ^ and $ to be the opposite of everything as they surely should be.

[I'm opinionated aren't i?  Note i say lack of foresight, not anything harsher. Hindsight is always clearer. I don't claim this is a bug in PerlRE, just a case where DWIM doesn't. ]

the possible improvement to workaround \b's asymmetry that i can think of  is
 (?<=^|\b|\s)(?:$PATTERN)(?=\s|\b|$)​    # obviously double the \\ for the "" if not using qr{}

​the ^|\s and ​ \s|$ are redundant but harmless (*) additions to \b when pattern matches only leading and trailing alpha_num (\w), but fixes the assymetry of \b when either/both start and end lexemes of match can be \W.

(*) harmless: assumes RE compiler is sane and optimizes helpfully.
Should Benchmark against known buggy version \b(?:$PATTERN)\b with much normal text to see how bad it harms the normal case; 
and against hand optimizations :
   ​​ (?:\b|(?<=^|\s))(?:$PATTERN)(?:\b|(?=\s|$))
   ​​ (?:^|\b|\s)\K(?:$PATTERN)(?:\b|(?=\s|$))
and maybe *COMMIT or *PRUNE  variants ??
(i haven't used those since Prolog days ... )


[ I consider using -w with a pattern that can and likely will  _match_ a whitespace char at either end erroneous, it violates concept of wordness that is asserted when asking -w, and is thus a PEBCAK user-error, behavio[u]?r undefined, and so I do not care what the above does to 'Sue! '  or '\w+[!].+'  matching "Sue! Fred!  Team!" ... but i predict it will usually work anyway, aside from nested edgecases. ]



--

Andy Lester

unread,
Mar 8, 2017, 4:21:39 PM3/8/17
to Bill Ricker, ack development

> On Mar 8, 2017, at 2:44 PM, Bill Ricker <bill....@gmail.com> wrote:
>
> On Wed, Mar 8, 2017 at 2:30 PM, Andy Lester <an...@petdance.com> wrote:
> Anyone see downsides? Is there something we can use better than \b ?
>
> Yes.

Just hashed out a strategy. In the next ack2, we'll be using "\b(?:PATTERN)\b", which won't be perfect but will be better than the surprising behavior we have right now.

Entirely coincidental to this, coworker just pinged me surprised that "ack -w 410[2]" matches "4102999" but "ack -w 4102" does not. So we have a non-zero user base that is being bitten by this.

As Bill points out, this is an imperfect match. So, in ack3 we will be doing -w as (?:^|\b|\s)\K(?:PATTERN)(?=\s|\b|$). Unfortunately, that \K requires Perl 5.10, so is going to have to wait until ack3 when we change the required Perl version.

Ed Avis: You've been silent during all this. How does this fit with what you're working with?

Andy Lester

unread,
Mar 9, 2017, 12:55:59 PM3/9/17
to Bill Ricker, ack development

On Mar 8, 2017, at 3:21 PM, Andy Lester <an...@petdance.com> wrote:

Just hashed out a strategy.  In the next ack2, we'll be using  "\b(?:PATTERN)\b", which won't be perfect but will be better than the surprising behavior we have right now.

Entirely coincidental to this, coworker just pinged me surprised that "ack -w 410[2]" matches "4102999" but "ack -w 4102" does not.  So we have a non-zero user base that is being bitten by this.

As Bill points out, this is an imperfect match.  So, in ack3 we will be doing -w as (?:^|\b|\s)\K(?:PATTERN)(?=\s|\b|$).  Unfortunately, that \K requires Perl 5.10, so is going to have to wait until ack3 when we change the required Perl version.

I’ve pondered this more, and now I’m wondering that we should NOT fix the -w in ack 2.  It’s possible that this fix will result in fewer hits for people.  They are hits that maybe shouldn’t have come out before, but that’s still the potential to have people miss results they would have seen with the old behavior.

Thoughts?

Bill Ricker

unread,
Mar 9, 2017, 10:10:21 PM3/9/17
to Andy Lester, ack development

On Thu, Mar 9, 2017 at 12:55 PM, Andy Lester <an...@petdance.com> wrote:

I’ve pondered this more, and now I’m wondering that we should NOT fix the -w in ack 2.  It’s possible that this fix will result in fewer hits for people.  They are hits that maybe shouldn’t have come out before, but that’s still the potential to have people miss results they would have seen with the old behavior.

​Right now if works for most simple cases and fails for ​leading or trailing meta chars, and the "fix" proposed would break it for non-word non-meta leading/trailing pattern in order to fix the meta, so yes, it could hurt some existing (allegedly) valid uses. 

​I would be happy to have just a warning "unsupported fixed in ack3" for leading or trailing chars that MIGHT be meta in pattern, for quick release of last stable ack2 . ​

Andy Lester

unread,
Mar 9, 2017, 10:11:45 PM3/9/17
to Bill Ricker, ack development

On Mar 9, 2017, at 9:10 PM, Bill Ricker <bill....@gmail.com> wrote:

​I would be happy to have just a warning "unsupported fixed in ack3" for leading or trailing chars that MIGHT be meta in pattern, for quick release of last stable ack2 . ​

I could see that.  What would be the mechanism to turn off the warning for people who really need the punctuation?

Bill Ricker

unread,
Mar 9, 2017, 11:58:57 PM3/9/17
to Andy Lester, ack development

On Thu, Mar 9, 2017 at 10:11 PM, Andy Lester <an...@petdance.com> wrote:
​I would be happy to have just a warning "unsupported fixed in ack3" for leading or trailing chars that MIGHT be meta in pattern, for quick release of last stable ack2 . ​

I could see that.  What would be the mechanism to turn off the warning for people who really need the punctuation?

​if they get the warning (or error), it means they're getting  wrong answers ​and need to drop the -w and just say what they really mean, which may well  be \b(?:$pattern)\b 
which suggestion should be in the message and man/dot.

(there MAY a small falsepositive for the message if pattern ends with a backslash-quoted meta char; but they can and and likely should follow advice anyway. if there was no possibility of falsepositive, i'd insist it's Error level fail.)

Edward Avis

unread,
Mar 10, 2017, 5:37:25 AM3/10/17
to ack development, bill....@gmail.com
>Just hashed out a strategy.  In the next ack2, we'll be using  "\b(?:PATTERN)\b", which won't be perfect but will be better than the surprising behavior we have right now.

That is certainly an improvement.  Have you compared it to the approach taken in my current merge request, which does

    (?:\b|(?!\w))(?:PATTERN)(?:\b|(?<!\w))

This will match more cases than the simpler regexp.  In particular, it means that the pattern

    foo[(][)]

will match the string 'foo()' -- while still not matching 'tofoo()', etc.

I think that my version may be slightly better adapted to searching through common programming languages, though if you think that the greater simplicity of just wrapping with \b is preferable I will be happy to go along with that.


>Entirely coincidental to this, coworker just pinged me surprised that "ack -w 410[2]" matches "4102999" but "ack -w 4102" does not.
>So we have a non-zero user base that is being bitten by this.

Well indeed, and that has been the case ever since the bug in -w handling was reported back in 2014 :-(


>As Bill points out, this is an imperfect match.  So, in ack3 we will be doing -w as (?:^|\b|\s)\K(?:PATTERN)(?=\s|\b|$).

Sounds interesting.  If the text has already been broken into lines, I would tend to prefer \z rather than the $ anchor, since $ allows for a newline character first.  (Also, the meaning of $ depends on the /m flag and has been unclear over much of Perl's history, being fixed only recently  -- check the perl5-porters archives for the full gory details.)  For symmetry I would also use \A which is unambiguously the start the string and does not depend on flags.

I don't yet have an idea of how this proposed regexp differs from the two alternatives above, or how it differs from what grep -w does.  The exact behaviour in corner cases may not matter that much, as long as it doesn't get set in stone so that bugs cannot be fixed until ack 4.


>I’ve pondered this more, and now I’m wondering that we should NOT fix the -w in ack 2.  It’s possible that this fix will result in fewer hits for people.

For my use case, getting fewer hits is the desired behaviour and it's what drove me to send the patch.  I can see what you're saying but in my view "false positive" and "false negative" matches are equally important to fix.  Without knowing exactly what the end use of ack is, it's impossible to say which of the two is the bigger problem.  I will note that ack is an interactive tool, and that if you want rigorously POSIX-specified semantics for regular expressions and flags or easily parsable output you would surely use grep instead.

Bill R. wondered whether


>the "fix" proposed would break it for non-word non-meta leading/trailing pattern

I agree that any new behaviour should not break existing patterns such as #foo where the start or end character is a non-word character (but not necessarily a regexp special character).

This is one advantage of my proposed scheme described above over the simple \b(?:PATTERN)\b.  Given the string '#foo' and the pattern '#foo', mine finds a match but the simpler approach does not.  Again, I suggest that this is better suited in practice to the task of searching through source code.

Edward Avis

unread,
Mar 10, 2017, 6:02:29 AM3/10/17
to ack development, bill....@gmail.com
FWIW, a handful of test cases checking the difference between the simple \b(...)\b approach and my approach:
#!/usr/bin/perl
use 5.022;
my @strs = ('foo', 'football', '123', '12345', 'foo()', 'football()', 'foo()a', 'tofoo()', '#foo', 'boo#foo');
my @patterns = ('foo', '123', 'fo.', '12.', 'foo..', 'foo[(][)]', '#foo', 'boo#');
foreach my $pattern (@patterns) {
    foreach my $str (@strs) {
        my $simple = $str =~ /\b(?:$pattern)\b/;
        my $complex = $str =~ /(?:\b|(?!\w))(?:$pattern)(?:\b|(?<!\w))/;
        $_ = $_ ? 'yes' : 'no' foreach $simple, $complex;
        if ($simple ne $complex) {
            say "difference for $str and $pattern: simple $simple, complex $complex";
        }
    }
}


Andy Lester

unread,
Mar 10, 2017, 12:48:17 PM3/10/17
to Edward Avis, ack development, Bill Ricker

> On Mar 10, 2017, at 4:37 AM, Edward Avis <e...@membled.com> wrote:
>
> >Entirely coincidental to this, coworker just pinged me surprised that "ack -w 410[2]" matches "4102999" but "ack -w 4102" does not.
> >So we have a non-zero user base that is being bitten by this.
>
> Well indeed, and that has been the case ever since the bug in -w handling was reported back in 2014 :-(

I know it's been going around for a while. My apologies. It wasn't until a week or two ago that I grokked the significance of what it meant in use cases.

Bill Ricker

unread,
Mar 10, 2017, 5:11:10 PM3/10/17
to Andy Lester, Edward Avis, ack...@googlegroups.com
Re 
If the text has already been broken into lines, 

It is in ack2 and i expect in ack3 

> I would tend to prefer \z rather than the $ anchor, since $ allows for a newline character first.  (Also, the meaning of $ depends on the /m flag 

Agreed. My footnotes in my rants upthread made same point. And we agree with Damian's PGP here. (Except he'd have us use /xism always.)

>  For symmetry I would also use \A which is unambiguously the start the string and does not depend on flags.

Yes. 

(One of _my_ pet suggestions for ack3 is supporting paragrep mode (I have files I have to use perl -000 instead of ack) in which case we may need /sm  for those 

Reply all
Reply to author
Forward
0 new messages