regex to find where 'sample text' is not followed by 'sample text' a couple of lines down

24 views
Skip to first unread message

Chris Jones

unread,
Nov 12, 2020, 6:43:07 PM11/12/20
to vim...@googlegroups.com
I am proofreading a document where a few words occur on one line and the
same exact words are replicated two lines down.

Here's a sample:

| ```{=latex}
| \index{Text that must occur twice}
| ```
| **2507. Text that must occur twice.** ... etc.

I found that it's easy to highlight such occurrences using (e.g.):

| /\\index{\(.*\)}\n```\n\*\*\d\+\. \1 " (1)

Now I noticed that once in a while the repeated text is not the same as
the text inside the curly brackets (i.e. in the \latex{...} command).

Trouble is... there are over three thousand such occurrences...

Rather than highlighting the instances where the two strings match
I would prefer to have vim highlight the ones where they don't.

In order to find them I tried:

| /\\index{\(.*\)}\n```\n\*\*\d\+\. \@<!\1 " (2)

The '\@<!' as I understand it means that my search pattern will match
everything up to and including the space... followed by something that
differs from the current value of the '\1' back reference.

For instance it should match/highlight something like this:

| ```{=latex}
| \index{Text that must occur twice}
| ```
| **2507. Text that must oKKur twice.** ... etc.

Unfortunately it does not.

Is this caused by a flaw in my logic or is it my misunderstanding of the
way '\@<!' works...?

What would be the right way to detect such discrepancies?

Thanks,

CJ

Tim Chase

unread,
Nov 12, 2020, 7:15:25 PM11/12/20
to Chris Jones, vim...@googlegroups.com
On 2020-11-12 18:42, Chris Jones wrote:
> I am proofreading a document where a few words occur on one line
> and the same exact words are replicated two lines down.
>
> Here's a sample:
>
> | ```{=latex}
> | \index{Text that must occur twice}
> | ```
> | **2507. Text that must occur twice.** ... etc.
>
> I found that it's easy to highlight such occurrences using (e.g.):
>
> | /\\index{\(.*\)}\n```\n\*\*\d\+\. \1 " (1)
>
> Now I noticed that once in a while the repeated text is not the
> same as the text inside the curly brackets (i.e. in the \latex{...}
> command).

As best I can tell, this should highlight \index{} entries that don't
match text in the following N lines (3ish here, though I might have a
fenceposting error)

/\\index{\zs\(.*\)\ze}\(\%(\n.*\)\{,3\}\1\)\@!

At least it passed all the tests I threw at it.

> In order to find them I tried:
>
> | /\\index{\(.*\)}\n```\n\*\*\d\+\. \@<!\1 " (2)
>
> The '\@<!' as I understand it means that my search pattern will
> match everything up to and including the space... followed by
> something that differs from the current value of the '\1' back
> reference.

The first in there is that the "\@<!" references the atom *before* it
(a space) rather than the atom *after* it (your \1). However, even if
you group them, it might not-match if off by even one character. I'd
have to play with it more to see if there are other nuances that
would cause issue.

-tim



Chris Jones

unread,
Nov 14, 2020, 4:18:58 PM11/14/20
to vim...@googlegroups.com
On Thu, Nov 12, 2020 at 07:15:07PM EST, Tim Chase wrote:
> On 2020-11-12 18:42, Chris Jones wrote:
> > I am proofreading a document where a few words occur on one line
> > and the same exact words are replicated two lines down.
> >
> > Here's a sample:
> >
> > | ```{=latex}
> > | \index{Text that must occur twice}
> > | ```
> > | **2507. Text that must occur twice.** ... etc.
> >
> > I found that it's easy to highlight such occurrences using (e.g.):
> >
> > | /\\index{\(.*\)}\n```\n\*\*\d\+\. \1 " (1)
> >
> > Now I noticed that once in a while the repeated text is not the
> > same as the text inside the curly brackets (i.e. in the \latex{...}
> > command).
>
> As best I can tell, this should highlight \index{} entries that don't
> match text in the following N lines (3ish here, though I might have a
> fenceposting error)
>
> /\\index{\zs\(.*\)\ze}\(\%(\n.*\)\{,3\}\1\)\@!
>
> At least it passed all the tests I threw at it.

Never heard the term fenceposting... at least in this context.

I tried to figure out the intended logic behind your regex but was
unable to do so.

So I copy/pasted it and it found 3-4 errors for me... cases where the
regexes that I used to create the '\index etc. tagging from my raw text
had run into ho-hum... difficulties... and I had to do the job
manually...

I find your regex very intersting as a case study... all the more so
because it uses syntax I was not aware of... such as the '\%(...) bit
that if I read the documentation correctly define a special kind of
subgroups... the \{,3} bit that apparently can be used to whatever
matches in the preceding \(...) - I'm guessing...

Could you explain further?

> > In order to find them I tried:
> >
> > | /\\index{\(.*\)}\n```\n\*\*\d\+\. \@<!\1 " (2)
> >
> > The '\@<!' as I understand it means that my search pattern will
> > match everything up to and including the space... followed by
> > something that differs from the current value of the '\1' back
> > reference.
>
> The first in there is that the "\@<!" references the atom *before* it
> (a space) rather than the atom *after* it (your \1). However, even if
> you group them, it might not-match if off by even one character. I'd
> have to play with it more to see if there are other nuances that
> would cause issue.

... and so it should. Not sure where this error in my regex crept in
since I originally copy/pasted something I found in some SE issue or
other...

Thanks,

CJ

Tim Chase

unread,
Nov 14, 2020, 8:16:38 PM11/14/20
to Chris Jones, vim...@googlegroups.com
On 2020-11-14 16:18, Chris Jones wrote:
> On Thu, Nov 12, 2020 at 07:15:07PM EST, Tim Chase wrote:
> > As best I can tell, this should highlight \index{} entries that
> > don't match text in the following N lines (3ish here, though I
> > might have a fenceposting error)
> >
> > /\\index{\zs\(.*\)\ze}\(\%(\n.*\)\{,3\}\1\)\@!
> >
> > At least it passed all the tests I threw at it.
>
> Never heard the term fenceposting... at least in this context.

Fenceposting errors are off-by-one errors (etymologically stemming
from the fact that to put up N fence segments you need N+1
fenceposts). So that "\{,3\}" might include one line too many/few.

> So I copy/pasted it and it found 3-4 errors for me... cases where
> the regexes that I used to create the '\index etc. tagging from my
> raw text had run into ho-hum... difficulties... and I had to do the
> job manually...

Yeah, it didn't catch multi-line \index{…} statements if you have
those, so you might have to go back and manually revisit any such
instances, but you should be able to find them with something like

/\\index{[^}]*$

> I find your regex very intersting as a case study... all the more so
> because it uses syntax I was not aware of... such as the '\%(...)
> bit that if I read the documentation correctly define a special
> kind of subgroups... the \{,3} bit that apparently can be used to
> whatever matches in the preceding \(...) - I'm guessing...

The "\%(…\)" is the same as the grouping/capturing "\(…\)"
except, well, it doesn't capture (for later reuse with either "\1",
"\2", etc, or in an expression with "submatch(2)"). I find that if I
explicitly use "\(…\)" when I want to capture and use "\%(…\)"
everywhere else, it's a lot clearer what I intended for capturing and
what is just there for grouping purposes. In retrospect, I should
have also used it on the outer grouping since I don't reuse that
capture

/\\index{\zs\(.*\)\ze}\%(\%(\n.*\)\{,3\}\1\)\@!
^ ^

So up to the first "}" finds the "\index{…}" and captures the stuff
in side for later reuse as "\1". It then has a big group

\%(…\)

that it asserts should not be findable here

\@!

Inside that "assert you can't find this following the \index{…}"
portion, it looks for "a newline followed by anything, up to three
times" followed by the stuff we captured earlier ("\1").

If it finds such a match, it's good, so the "you can't find this \@!"
assertion fails and it doesn't highlight. If it doesn't find such a
match, it's either because it's too far away (try increasing the "3"
to some further distance) or because the text is present but doesn't
match what was inside the \index{…} which could be because of typos
or because of line-breaks. I.e., if you have

\index{one two three}
one two
three

it will highlight it as a non-match because the line-breaks make the
two different.

> Could you explain further?

Hopefully the explanation above makes enough sense that you can start
experimenting with it and feel more confident in your abilities to
use it to bludgeon future problems, bending them to your will. :-D

-tim




Chris Jones

unread,
Nov 15, 2020, 4:39:34 PM11/15/20
to vim...@googlegroups.com
On Sat, Nov 14, 2020 at 08:16:26PM EST, Tim Chase wrote:
> On 2020-11-14 16:18, Chris Jones wrote:

[..]

> The "\%(…\)" is the same as the grouping/capturing "\(…\)"
> except, well, it doesn't capture (for later reuse with either "\1",
> "\2", etc, or in an expression with "submatch(2)"). I find that if I
> explicitly use "\(…\)" when I want to capture and use "\%(…\)"
> everywhere else, it's a lot clearer what I intended for capturing and
> what is just there for grouping purposes. In retrospect, I should
> have also used it on the outer grouping since I don't reuse that
> capture
>
> /\\index{\zs\(.*\)\ze}\%(\%(\n.*\)\{,3\}\1\)\@!
> ^ ^
>
> So up to the first "}" finds the "\index{…}" and captures the stuff
> in side for later reuse as "\1". It then has a big group
>
> \%(…\)
>
> that it asserts should not be findable here
>
> \@!
>

In fact (apart from the fact that it works where mine did not...) your
solution is a simplification of my effort. And as an added bonus it
makes it more general catching possible errors outside the targeted
string.

Now since I have in my ~/.vimrc...

| :set ignorecase
| :set smartcase

.. I added the '\C' flag to detect lower/upper case discrepancies:

| \\index{\zs\(.*\)\ze}\%(\%(\n.*\)\{,3\}\1\)\@!\C

> Inside that "assert you can't find this following the \index{…}"
> portion, it looks for "a newline followed by anything, up to three
> times" followed by the stuff we captured earlier ("\1").
>
> If it finds such a match, it's good, so the "you can't find this \@!"
> assertion fails and it doesn't highlight. If it doesn't find such a
> match, it's either because it's too far away (try increasing the "3"
> to some further distance) or because the text is present but doesn't
> match what was inside the \index{…} which could be because of typos
> or because of line-breaks. I.e., if you have
>
> \index{one two three}
> one two
> three
>
> it will highlight it as a non-match because the line-breaks make the
> two different.
>
> > Could you explain further?
>
> Hopefully the explanation above makes enough sense that you can start
> experimenting with it and feel more confident in your abilities to
> use it to bludgeon future problems, bending them to your will. :-D

Nothing hopeful about it! The explanation just works... at least where
I'm concerned. And the good thing is that since this was a real problem
that I had there's a good chance it'll stick...

I played with the proffered regex for a while and I was unable to find
anything wrong with it.

Thanks,

CJ
Reply all
Reply to author
Forward
0 new messages