ack 3

33 views
Skip to first unread message

Andy Lester

unread,
Feb 23, 2017, 11:34:12 PM2/23/17
to ack...@googlegroups.com
I'm seriously considering starting ack 3.

I think that the infrastructure that we have is pretty brittle, especially with all the hoohah we put in to allow plugins. Well, nobody uses plugins, and nobody wants to write them, so they need to get yanked.

We also have some pretty heinous bugs like -c not counting lines correctly, and highlighting getting screwed up on capture groups.

Also, --ignore is a mess because of how it's defined. I'm not even sure myself how it works. Is it absolute? Is it relative? To what? We need to hammer that out.

I'm also thinking that we can safely require Perl 5.10. This will give us state variables, which I suspect may be very useful.

I went through the issues and tagged the ones that I think are most pressing, and will be new features in ack 3. https://github.com/petdance/ack2/issues?q=is%3Aissue+is%3Aopen+label%3Aack3

Thoughts?

--
Andy Lester => www.petdance.com

Andy Lester

unread,
Feb 24, 2017, 12:40:35 PM2/24/17
to ack...@googlegroups.com

> On Feb 23, 2017, at 10:34 PM, Andy Lester <an...@petdance.com> wrote:
>
> I'm seriously considering starting ack 3.

I'm thinking that ack3 should be a new repo, since the current one is called "ack2", but if we do that then any references to existing issues, such as in t/issue*.t, will be broken.

Bill Ricker

unread,
Feb 24, 2017, 1:39:41 PM2/24/17
to Andy Lester, ack...@googlegroups.com

On Fri, Feb 24, 2017 at 12:40 PM, Andy Lester <an...@petdance.com> wrote:
> I'm seriously considering starting ack 3.

I'm thinking that ack3 should be a new repo, since the current one is called "ack2", but if we do that then any references to existing issues, such as in t/issue*.t, will be broken.

​OTOH could make it ack2.5 and keep the ack2 repo, with a branch for ack2.5 work ? ​
Or, make it 2.9  for the working branch with intent to release as 3.0 when out of beta, and cut over to new repo *then* ?

​(longlived branches and cross-merging deltas works well in git, unluck ​some earlier sccs packages)


--

Edward Avis

unread,
Feb 24, 2017, 4:58:06 PM2/24/17
to ack development
I think it would make sense to do some cleanups (such as removing plugins, and starting to use a more modern Perl version) before starting any big new version.  I don't think it matters whether the next version be called ack 2.17 or 2.18 or 3.0 or whatever, but I instinctively prefer an incremental approach.

Ack isn't a library where you have to carefully preserve multiple stable branches.  It's an interactive tool and I think it's fine for behaviour to change, if the new behaviour is an improvement.  In scripting one would use grep instead, which does have much more rigorously defined behaviour.  So personally I would not let any version number arguments stand in the way of making changes.

Since you mention heinous bugs, I will take the opportunity to promote my own pet bug, "ack -w '(foo|bar)'" which incorrectly matches 'football', 'barfly' and so on even though they are not whole word matches.

Andy Lester

unread,
Feb 24, 2017, 4:59:07 PM2/24/17
to Edward Avis, ack development

> On Feb 24, 2017, at 11:37 AM, Edward Avis <e...@membled.com> wrote:
>
> Since you mention heinous bugs, I will take the opportunity to promote my own pet bug, "ack -w '(foo|bar)'" which incorrectly matches 'football', 'barfly' and so on even though they are not whole word matches.

Please elaborate on what you see as the bug and what you propose to fix it.

Bill Ricker

unread,
Feb 24, 2017, 5:13:24 PM2/24/17
to Andy Lester, Edward Avis, ack development
On Fri, Feb 24, 2017 at 4:58 PM, Andy Lester <an...@petdance.com> wrote:
> Please elaborate on what you see as the bug and what you propose to fix it.

1. Elaborate

Semantically
ack -w '(foo|bar)'
should be the same as any of
ack '\b(foo|bar)\b'
ack '(\bfoo\b|\bbar\b)'
ack '(?x) \b (?x: foo | bar ) \b'

With this input file
$ cat barfly.txt
yes foo
yes bar
no schmfoo
no schmofool
no barfly
no fubar
no barometric
no subarometric
$
we want the same result as `ack yes` ( which is what we get with the
\b boundaries )
$ ack '(?x) \b (?x: foo | bar ) \b' barfly.txt
yes foo
yes bar
$

but with ack2 i get

$ ack -w '(foo|bar)' barfly.txt
yes foo
yes bar
no schmfoo
no schmofool
no barfly
no fubar
no barometric
no subarometric
$

That's an Oopsie.

2. Propose

first, add barfly.txt above to our test suite ...

if there's an obviously simple bugfix, do so in ack2;
else burn down existing -w code and code it right in ack3.

Bill Ricker

unread,
Feb 25, 2017, 2:17:49 PM2/25/17
to Andy Lester, Edward Avis, ack development
Aha, (2) Propose ... i find this is already reported in
https://github.com/petdance/ack2/issues/445 (and previously as #14 also) and
https://github.com/petdance/ack2/pull/558 has a pull request from EPA=Ed Avis

so the above email comment was a reminder that Ed has submitted a
much-discussed pull request to fix this, but it has a merge conflict.

Packy Anderson

unread,
Feb 25, 2017, 3:02:17 PM2/25/17
to Andy Lester, ack...@googlegroups.com
On Thu, Feb 23, 2017 at 11:34 PM, Andy Lester <an...@petdance.com> wrote:
I think that the infrastructure that we have is pretty brittle, especially with all the hoohah we put in to allow plugins.  Well, nobody uses plugins, and nobody wants to write them, so they need to get yanked.

What were plugins supposed to be for?  I forget.


--
Packy Anderson

Email:  PackyA...@gmail.com
GVoice: (646) 833-8832

Edward Avis

unread,
Feb 27, 2017, 11:01:10 AM2/27/17
to ack development, an...@petdance.com, e...@membled.com
Thanks Bill R. for elaborating about the bug with -w.  An even simpler test case is to say that

    ack -w 'fo{2}'

should be semantically equivalent to

    ack -w 'foo'

since the two regexps match the same strings.  Currently the first will match 'football' but the second does not.  This doesn't match (my personal) expectation of a "whole words match" and it is also inconsistent with grep.

There are various apparent fixes which turn out to be wrong -- indeed, the current behaviour dates back to an earlier incorrect attempt to fix the -w flag, which in its very first incarnation just textually wrapped the regexp in \b anchors.  After one false start I suggested in bug 445 a method which I believe handles all cases correctly:

        $str = "(?:\\b|(?!\\w))$str";
        $str = "$str(?:\\b|(?<!\\w))";

As well as the proposed fix I submitted a merge request to document and test the current behaviour of -w, which you can find in trunk.
However I hope you agree with me that this documentation of the odd "words mode" is describing a peculiar behaviour that few would want to use,
and the test case added is testing for the presence of a bug rather than serving to describe a useful and correct set of semantics.
Still, it does ensure that when the bug is fixed the documentation and test cases will be updated with it.

This recent documentation change must be the reason why my merge request no longer merges -- I will see if I can rebase it.

Andy Lester

unread,
Feb 27, 2017, 11:06:23 AM2/27/17
to Edward Avis, ack development

On Feb 27, 2017, at 10:01 AM, Edward Avis <e...@membled.com> wrote:

This recent documentation change must be the reason why my merge request no longer merges -- I will see if I can rebase it.
Thank you, but don’t bother.  If there are changes to be made to how -w works, it will be in 3.0, which will be a new project.

Edward Avis

unread,
Feb 27, 2017, 11:20:00 AM2/27/17
to ack development, e...@membled.com
Would you accept a patch to add a warning when -w is used with an exotic regular expression?

    % ack -w 'fo{2}'
    warning: although -w is given, matches found will not necessarily be whole words, since the regexp does not end in a word character

At the moment it is a bit of a trap for the unwary if you are expecting semantics similar to grep.
A warning would also help the transition to ack 3.0, if indeed you decide to change the behaviour of -w in that version.

Andy Lester

unread,
Feb 27, 2017, 11:23:10 AM2/27/17
to Edward Avis, ack development

On Feb 27, 2017, at 10:20 AM, Edward Avis <e...@membled.com> wrote:

    % ack -w 'fo{2}'
    warning: although -w is given, matches found will not necessarily be whole words, since the regexp does not end in a word character

Interesting.  Thoughts from others?

Edward Avis

unread,
Feb 27, 2017, 11:23:21 AM2/27/17
to ack development, e...@membled.com
For the record, I have rebased my fixes to -w and sent them as <https://github.com/petdance/ack2/pull/629>
Although you have indicated you don't want to change the behaviour in ack 2.x, I hope the pull request will be useful for those who want to apply a local patch to fix -w.

Bill Ricker

unread,
Feb 27, 2017, 5:50:02 PM2/27/17
to Andy Lester, Edward Avis, ack development

On Mon, Feb 27, 2017 at 11:23 AM, Andy Lester <an...@petdance.com> wrote:
On Feb 27, 2017, at 10:20 AM, Edward Avis <e...@membled.com> wrote:

    % ack -w 'fo{2}'
    warning: although -w is given, matches found will not necessarily be whole words, since the regexp does not end in a word character

Interesting.  Thoughts from others?

​Well, in a final or temporizing Ack2 release​ (whether 2_16 or 2_20) I would _prefer_ to apply a real patch for this, if it appears safe and complete, as this is an ugly bug, whether it's EPA's pull request or a different solution to the problem.
   If.
   (But I would like a bit more in added testcases, which i might choose to provide if we go this route, since I don't think the proferred PR added enough tests to fully constrain the desired behavior. In particular I like negative tests, don't find this!  It's still unclear to me what a non-meta non-word char at beginning or end _should_ do with -w; they might be erroneous. Obviously Metachars shouldn't change the sense of the boundaries from what the first and last non-meta-chars match, since they're not matched but modifying something else. But how do we know the trailing } is meta and not a \}  ?  A disjunctive pattern or pattern with optional char first or last doesn't even have definite single first and last ... Parsing a regexp is NOT something we want to do if we can avoid it ! It's turtles or at least gophers all the way down.  )

Failing that, yes, a warning that -w is not warrantied in corner cases where pattern fails to start and end with word chars
   warn $blahblah unless $pattern =~ m{ \A \w.*\w \z }x ;
is better than leaving ack2 -w silently broken forever, even if it over-warns.

If MAN section on -w can give mention the warning, explain the likely misbehavior more fully, apologize that it will be fixed for real in Ack3 when we adjust buffering, and hint for how to decorate a pattern to safely work with broken -w (if even possible) that would be good too.

It seems as if -w is only guaranteed totally correct if the pattern is a literal word with no metachars at all ... possibly influenced by -i .

Edward Avis

unread,
Feb 28, 2017, 4:47:37 AM2/28/17
to ack development, an...@petdance.com, e...@membled.com
Bill R., you are right that the test coverage could be more complete.  I will see if I can expand it.


>Parsing a regexp is NOT something we want to do if we can avoid it !

I quite agree, which is why my fix for -w does not parse the regexp at all.

Bill Ricker

unread,
Mar 3, 2017, 1:18:18 PM3/3/17
to Edward Avis, ack development, Andy Lester

On Tue, Feb 28, 2017 at 4:47 AM, Edward Avis <e...@membled.com> wrote:
Bill R., you are right that the test coverage could be more complete.  I will see if I can expand it.

​Sounds like Andy is committed to doing -w right in Ack3 ... he's brainstormed what's needed to support good tests for both -w and highlighting $&, and is interested in a full specification of what 'right' is for -w.  He sees ack -w ‘(set|get)_user_(name|id)’  as a reasonable thing to do.(*)  Since the first and last literals matched are  word-like no matter what path taken, this is unambiguous. IDK offhand what -w should mean if pattern could match either word or non-word char at one end or other ... something we need to brainstorm for Ack3 design.

(*) (One can ask if those should be non-capturing by default. But on other thread, Andy said ack3 will highlight $& so it sort of doesn't matter.)

So for stabilizing Ack2's open bugs wrto -w, we should focus on a PR for setting user expectations --
* when to warn or error for dubious -w pattern
     is it too easy to miss a Warning when in  scrolling ?
    are there any cases of leading or trailing nonword  $pattern =~ m{ \A\W | \W\z }x  that DO work reliably?
* wording of message
* extra documentation in Man on how to get what you really want e.g. use '^(foo|bar|bat)\b' instead of  -w '^(?:foo|bar|bat)'

(​We should probably document the useful uses of extended REs in the man-page too.​)

Edward Avis

unread,
Mar 6, 2017, 4:45:41 AM3/6/17
to ack development, e...@membled.com, an...@petdance.com
My view is that -w at the start of the match cannot be in the middle of a word.  That is, either the first character of the match is the first character of a word, or the first character of the match is a non-word character (or the match is empty).  Similarly, the end of the match cannot be in the middle of a word.

However, grep -w takes a stricter approach.  It requires a non-word character (or start of line) just before the match, and similarly a non-word character (or end of line) after the match.  The difference is with a file containing the line

    a:::a

and a search like

    grep -w ::: test_file

Grep will not find a match, while my proposed implementation of -w would find one.

However, I am not particularly attached to the semantics I picked.  Ack's general principle is "what would grep do" and I think it may be sensible to follow it in this case.  All I will say is that the current semantics of -w in ack are really just an accident (from an incomplete bug fix to an earlier attempt) and don't really deserve the label "stable".  A stable release in my opinion would be the one that gets the supported features working as intended, rather than preserving mosquitoes in amber.

But anyway - if we can agree to "do what grep does" for -w as for other flags, that should cut through a lot of confusion and also provide a ready-made test suite (copy the one from GNU grep, or simply run grep -w on lots of randomly generated strings).

If Andy L. agrees that following grep is the way forward, then I will update my merge request to do that.  It can then be merged in once ack 3 development opens.

Andy Lester

unread,
Mar 6, 2017, 9:56:23 AM3/6/17
to Edward Avis, ack development

On Mar 6, 2017, at 3:45 AM, Edward Avis <e...@membled.com> wrote:

If Andy L. agrees that following grep is the way forward, then I will update my merge request to do that.  It can then be merged in once ack 3 development opens.



We are nowhere near the point of being ready to write code.  We need to hash out specifics of how -w works.

I am not going to change the behavior in ack 2, so please don’t bother working on code for -w in ack 2.

I appreciate the enthusiasm, and we just need to wait a bit on it.

Edward Avis

unread,
Mar 6, 2017, 10:01:07 AM3/6/17
to ack development, e...@membled.com
I just wondered if you agree with the general principle of "do what grep does" being applied to the -w flag too.  If so, then it takes care of a great deal of the specifics of how -w works.

If you don't feel that doing the same thing as grep is the right behaviour for -w, or you want to consider other possible behaviours even if they would break grep-compatibility, then I will hold back.

Andy Lester

unread,
Mar 6, 2017, 10:06:04 AM3/6/17
to Edward Avis, ack development
On Mar 6, 2017, at 9:01 AM, Edward Avis <e...@membled.com> wrote:

I just wondered if you agree with the general principle of "do what grep does" being applied to the -w flag too.  If so, then it takes care of a great deal of the specifics of how -w works.

In general, yes, we should do what grep does.

My main point in that reply was to save you from going and writing code that probably would not get used.

Edward Avis

unread,
Mar 6, 2017, 10:48:43 AM3/6/17
to ack development, e...@membled.com
Thanks.  At this point I am concerned with having a working -w implementation for my own use and that of the users at my site.  I will probably move to "do what grep does" in order to align with the planned change in upstream.

Andy Lester

unread,
Mar 6, 2017, 10:49:51 AM3/6/17
to Edward Avis, ack development

On Mar 6, 2017, at 9:48 AM, Edward Avis <e...@membled.com> wrote:

Thanks.  At this point I am concerned with having a working -w implementation for my own use and that of the users at my site.  I will probably move to "do what grep does" in order to align with the planned change in upstream.



Right now my big concern is test cases.  Not necessarily to the .t level, but examples of “this matches, this doesn’t”.  If you have any, that would be a great kickoff.

Bill Ricker

unread,
Mar 6, 2017, 10:54:55 AM3/6/17
to Andy Lester, Edward Avis, ack development

On Mon, Mar 6, 2017 at 10:06 AM, Andy Lester <an...@petdance.com> wrote:
In general, yes, we should do what grep does.

Beyond "in general", if grep is obviously egregiously wrong (unlikely, 40+ years on) ​or incompletely specified, we can embroider around the edges.

'grep -w' seems to assume that the pattern is a word not an antiword, so that 'grep -w :::' would be erroneous usage and undefined? If so, this is a case were we can legitimately extend the meaning.

I'm not sure if -w ':::' should match '-:::-' or not; Ed seems to suggest yes, I'm leaning towards no.  The pattern doesn't start in the middle of a true word, but it starts in the middle of an antiword.    If -w $antiword means anything, i would think it should mean a clump of printable characters delimited by spacing or word chars. So instead of just wrapping with \b we need

$pattern =
   qr{ (?: \A | \b | \s ) # \s|\b is redundant with \w target but needed for \W pattern
       $pattern
       (?: \b | \s | \z )
      }x;


except the (?: ...)  should probably be lookahead and lookbehind so as not to contribute to $&.

Edward Avis

unread,
Mar 6, 2017, 12:22:36 PM3/6/17
to ack development, an...@petdance.com, e...@membled.com
Bill R. wrote:

>'grep -w' seems to assume that the pattern is a word not an antiword

The GNU grep documentation specifies that

>the matching substring must either be at the beginning of the line, or preceded by a non-word
>constituent character.  Similarly, it must be either at the end of the line or followed by a non-word constituent character.

So I think it's well defined, whether the pattern be a word, an antiword, or a mixture.

While I agree that we could start to define different behaviours for 'ack -w :::' which we may consider make more sense than grep, I think the bar has to be set quite high for that.  This is partly for philosophical reasons (ack tries to follow what grep does), but mostly for practical ones.  If you decide to follow grep then any argument about semantics can be settled instantly.  If you decide to do something which is almost the same but a tiny bit better, you end up with long mailing list threads going over what ought to happen.  That is why I suggest copying grep, not because I have a strong opinion about what -:::- should match, but to simplify life and allow us to get on with implementation.

Andy Lester

unread,
Mar 6, 2017, 12:24:37 PM3/6/17
to Edward Avis, ack development

On Mar 6, 2017, at 11:22 AM, Edward Avis <e...@membled.com> wrote:

While I agree that we could start to define different behaviours for 'ack -w :::' which we may consider make more sense than grep, I think the bar has to be set quite high for that.  This is partly for philosophical reasons (ack tries to follow what grep does), but mostly for practical ones.  If you decide to follow grep then any argument about semantics can be settled instantly.  If you decide to do something which is almost the same but a tiny bit better, you end up with long mailing list threads going over what ought to happen.  That is why I suggest copying grep, not because I have a strong opinion about what -:::- should match, but to simplify life and allow us to get on with implementation.

So what behavior exactly are you proposing?

Edward Avis

unread,
Mar 6, 2017, 12:28:36 PM3/6/17
to ack development, e...@membled.com
On Monday, 6 March 2017 15:49:51 UTC, Andy Lester wrote:

>Right now my big concern is test cases

Do you mean test cases for -w?  There are some in GNU grep's test suite (though not as many as I had expected).  There are also some test cases in my -w fix, based on the standard ack test files.  If you agree that "what would grep do" is the way forward, I will produce more tests and cross-check them against GNU grep.  If not, I may stand back and let others opine on how -w should behave, since in fact I don't have strong opinions on the matter except to note that the current -w code is not very useful.

Edward Avis

unread,
Mar 6, 2017, 12:30:48 PM3/6/17
to ack development, e...@membled.com
So what behavior exactly are you proposing?

I am proposing to make the -w flag follow GNU grep, which documents its behaviour as

>Select only those lines containing matches that form whole words.  The
>test is that the matching substring must either be at the beginning of

>the line, or preceded by a non-word constituent character.  Similarly,
>it must be either at the end of the line or followed by a non-word
>constituent character.  Word-constituent characters are letters,
>digits, and the underscore.

This is subtly different from the code I wrote, but I am quite willing to change the code and the test cases in the interest of following the precedent set by grep.

Andy Lester

unread,
Mar 6, 2017, 12:33:34 PM3/6/17
to Edward Avis, ack development

On Mar 6, 2017, at 11:30 AM, Edward Avis <e...@membled.com> wrote:

I am proposing to make the -w flag follow GNU grep, which documents its behaviour as

Weren’t you also proposing that “ack -w :::” be an error?  Or am I misremembering?

Bill Ricker

unread,
Mar 6, 2017, 1:52:23 PM3/6/17
to Andy Lester, ack...@googlegroups.com


On Mon, Mar 6, 2017 at 12:57 PM, Andy Lester <an...@petdance.com> wrote:
>
> that was ​not Ed​

Sorry. Thanks for clarifying.  I’ve been lost in Real Work this morning.

​Paying Work is good !
 
> ​(a) he proposed Warning as interim  mitigation only
> (b) i proposed Error instead of warning​ (for mitigation)

In 2.x, correct? 

​Right
 
I’m OK with pursuing these ideas, AFTER we know what exactly we’re doing with ack 3.

​We  were proposing it as part of the Stabilizing ack2 so we could concentrate on ack3,
but whenever
 
> (c) i also suggested that since ::: can't be a true 'word' it could be a real error


So what then is a “true word”?  To me these are all legit to call “ack3 -w” on.

​Perldoc PerlRE says a word is m{ \b \w+ \b​
 
​ }x​;  so that's what i mean.
( ack -w 'foo-bar' isn't even searching for a trueword but will probably work anyway.)

foo
​of course
 
foo(bar.*)

​matches a \w+ string so *should* be supported in ​ack3 but probably not in ack2 final stable

(set|get)_user_(profile|name|records?)

(​'=' =~​
 
​/​
\w​/
​) so no in 2, yes in 3, as above ​


(foo)\1

( ​'1'=~ \w )​
 
​so ditto
\w+

​the very definition of true word
 
\d+

​\d is a subset of \w so sure
 
(_+)\w+\1     # __IS_DEFINED__, _SOME_MARKER_, etc

yes  yes
  but not #DEBUG as  ​'#' !~ /\w/
*unless* we allow a ackrc|ACKENV setting to change -w's words
(which we may need as Perl6 allows '-' in idents !)


It seems to me that trying to define what a “word” is such that we can warn if you call ack3 -w on a non-word is a losing proposition.  But I’m open to ideas.

​This is why i was leaning towards -w ::: being an error.

​My basic claim is since we allow full RE ack pattern, the knowledgable PCRE|PerlRE user can say what they really mean, and should be encourage to do so, as  -w DWIM would have to be --type dependent. We should emphasize -w is  fgrep -w compatible and do no more and no less.

To force fgrep -w compatibility, we could have -w imply -Q .
Or we could error -Q would change the  RE, to force user to explicitly say what they mean.

(Compatibility comment:  We don't support egrep patterns, we support PCRE patterns instead, so we can never claim egrep compatibility; we say 'grep' but we really mean the fgrep subset we when we talk about ack being drop-in compatible. So we can define the extension from -w literal to -w $PERLRE as we like. Including erroring please just say what you really mean instead of us trying to DWIM and erroring if $PERLRE compiles to more than one literal string. Or defining what it means over all filetypes. Or defining a way for --type declaration to define -w per file type, uh oh, noooooo. Filetypes containing two sub-languages would never DWIM.)


(We could in _theory_ automatically  add any \W chars in given -w 'word' ​as temporarily word-ish, but that way lies madness.  e.g.
  ack3 -w '#DEBUG'
will look for
   qr{ (?x:)(<= ^ | \b | \s | \W ) [#] DEBUG (= \b | \s | $) )  } .) 
   (or the \A\z equivalent; )
we add \W to the lookbehind so it will match <title>#DEBUG</title> , as we consider # temporarily as honorary word-char. But how do we handle this in general case? This means inspecting the compiled QR to see if it can match a leading \W,
ugh, that way lies madness.)

OR ...  we specify that we explicitly only  support -w $RE that matches strings with a leading or trailing \W  (†) unless the leading (and/or trailing) non-word-chars matched are expressed as literals in first (respectively last) position(s), thus all metachars internal.
    Meaning we can and will assume -w $RE
  • with leading and trailing literal word chars are good to wrap the easy way;
  • with leading and/or trailing non-word non-meta require special lookaround wrapping based on some wordbreak rule TBD possibly that above;
  • with Leading and/or trailing metachars will match truewords only (or matches a subset of  /\A\w.*\w\z/ ) and so can be wrapped with  , same as with a literal trueword
     [the (?:^|\b|$) don't need to be lookaround since zero width.]

†​(LET 'RE1 intersects with RE2' := RE1 matches some string that also matches RE2
$RE intersects with  m(\A\W.*|.*\W\z)


Reply all
Reply to author
Forward
0 new messages