>Just hashed out a strategy. In the next ack2, we'll be using "\b(?:PATTERN)\b", which won't be perfect but will be better than the surprising behavior we have right now.
That is certainly an improvement. Have you compared it to the approach taken in my current merge request, which does
(?:\b|(?!\w))(?:PATTERN)(?:\b|(?<!\w))
This will match more cases than the simpler regexp. In particular, it means that the pattern
foo[(][)]
will match the string 'foo()' -- while still not matching 'tofoo()', etc.
I think that my version may be slightly better adapted to searching through common programming languages, though if you think that the greater simplicity of just wrapping with \b is preferable I will be happy to go along with that.
>Entirely coincidental to this, coworker just pinged me surprised that "ack -w 410[2]" matches "4102999" but "ack -w 4102" does not.
>So we have a non-zero user base that is being bitten by this.
Well indeed, and that has been the case ever since the bug in -w handling was reported back in 2014 :-(
>As Bill points out, this is an imperfect match. So, in ack3 we will be doing -w as (?:^|\b|\s)\K(?:PATTERN)(?=\s|\b|$).
Sounds interesting. If the text has already been broken into lines, I would tend to prefer \z rather than the $ anchor, since $ allows for a newline character first. (Also, the meaning of $ depends on the /m flag and has been unclear over much of Perl's history, being fixed only recently -- check the perl5-porters archives for the full gory details.) For symmetry I would also use \A which is unambiguously the start the string and does not depend on flags.
I don't yet have an idea of how this proposed regexp differs from the two alternatives above, or how it differs from what grep -w does. The exact behaviour in corner cases may not matter that much, as long as it doesn't get set in stone so that bugs cannot be fixed until ack 4.
>I’ve pondered this more, and now I’m wondering that we should NOT fix
the -w in ack 2. It’s possible that this fix will result in fewer hits
for people.
For my use case, getting fewer hits is the desired behaviour and it's what drove me to send the patch. I can see what you're saying but in my view "false positive" and "false negative" matches are equally important to fix. Without knowing exactly what the end use of ack is, it's impossible to say which of the two is the bigger problem. I will note that ack is an interactive tool, and that if you want rigorously POSIX-specified semantics for regular expressions and flags or easily parsable output you would surely use grep instead.
Bill R. wondered whether
>the "fix" proposed would break it for non-word non-meta leading/trailing pattern
I agree that any new behaviour should not break existing patterns such as #foo where the start or end character is a non-word character (but not necessarily a regexp special character).
This is one advantage of my proposed scheme described above over the simple \b(?:PATTERN)\b. Given the string '#foo' and the pattern '#foo', mine finds a match but the simpler approach does not. Again, I suggest that this is better suited in practice to the task of searching through source code.