Problem with a regular expression in Vim

121 views
Skip to first unread message

Xell Liu

unread,
Oct 2, 2012, 10:37:34 PM10/2/12
to vim_use
Hi all,

Suppose this text fragment:

xxx==aaa==bbbccc==ddd==yyy

How can I match the "aaa" and "ddd" between the pair of "==" without
matching the bbbccc (or, of course, "xxx" or "yyy")? Apparently
/==\zs[^=]\{-}\ze==/ fails. However /==[^=]\{-}==/ does match the
"aaa" and "ddd" WITH the pair of "==". I got lost here.

Thanks very much!

Cheers,
Xell

Adam

unread,
Oct 2, 2012, 11:17:44 PM10/2/12
to vim...@googlegroups.com
/==\zs\%(aaa\|ddd\)\ze==/ works
~Adam~



--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Ben Fritz

unread,
Oct 2, 2012, 11:41:49 PM10/2/12
to vim...@googlegroups.com
On Tuesday, October 2, 2012 9:38:20 PM UTC-5, Xell Liu wrote:
> Hi all,
>
>
>
> Suppose this text fragment:
>
>
>
> xxx==aaa==bbbccc==ddd==yyy
>
>
>
> How can I match the "aaa" and "ddd" between the pair of "==" without
>
> matching the bbbccc (or, of course, "xxx" or "yyy")? Apparently
>
> /==\zs[^=]\{-}\ze==/ fails.

For me it matches exactly what you told it to:

==\zs : after two '=' characters...
[^=] : match ANY character which is not a '='...
\{-} : ANY number of times...
\ze== : until another pair of '=' characters.

In your example, this matches aaa, bbbccc, and ddd. There is no reason I can think of to expect otherwise.

> However /==[^=]\{-}==/ does match the
>
> "aaa" and "ddd" WITH the pair of "==". I got lost here.
>
>

I'm guessing you noticed that hitting 'n' after searching for this pattern does not match the bbbccc in your example text. This actually surprised me a little, but I think it happens because:

You are searching for the "next" match.
The beginning of the ==bbbccc== string, which matches your PATTERN, is inside the current match.
To be useful, the search needs to start AFTER the current match, therefore your pattern is not even tried at this next position.

By setting match end with \ze in your first pattern, you make it so the current match ends BEFORE the ==bbbccc=== string, therefore allowing your pattern to be tried and matched on this text.

Now, how to fix this? Well...that depends on what you actually want. Somebody suggested aaa\|ddd in your pattern, which will certainly match your example text as desired, but I have a feeling the example text is actually not all that similar to what you actually want to match.

In what way are aaa and ddd similar? How do they differ from bbbccc? Is it a matter of number of characters? If so, just replace \{-} with \{3} to match exactly three characters, or \{1,3} for 1-3 characters, or \{0,3} to more closely match what you have now. Or are you actually looking for a literal aaa or ddd?

Xell Liu

unread,
Oct 2, 2012, 11:42:40 PM10/2/12
to vim...@googlegroups.com
Thanks. But "aaa" and "bbb" are merely examples for indicating that
what I want to match is the contents between a pair of "==".
Enumerating of the contents doesn't meet my requirement.

Xell Liu

unread,
Oct 3, 2012, 12:05:54 AM10/3/12
to vim...@googlegroups.com
Thanks very much for your detailed explanation.

In fact, by pseudocode I think I can put my requirement like this:

1. Search for the first pair of "==" from the beginning location where
the search starts.
2. Extract the contents in the pair of "==" as the first match result.
3. Disable/Invalidate/Remove the matched contents with its "==" surroundings.
4. Repeat from 1. until reaching the end of the search range.

So, is it possible to do that by regular expression? I'm not very
familiar with the concepts of greedy/non-greedy or zero-match (like
the queer things of \@! or \@<= etc.) or . Will they help?

Paul Isambert

unread,
Oct 3, 2012, 2:55:11 AM10/3/12
to vim...@googlegroups.com
Xell Liu <xell...@gmail.com> a écrit:
>
> Thanks very much for your detailed explanation.
>
> In fact, by pseudocode I think I can put my requirement like this:
>
> 1. Search for the first pair of "==" from the beginning location where
> the search starts.
> 2. Extract the contents in the pair of "==" as the first match result.
> 3. Disable/Invalidate/Remove the matched contents with its "==" surroundings.
> 4. Repeat from 1. until reaching the end of the search range.
>
> So, is it possible to do that by regular expression? I'm not very
> familiar with the concepts of greedy/non-greedy or zero-match (like
> the queer things of \@! or \@<= etc.) or . Will they help?

I don't know what you mean exactly by "extract" in point 2, but to me it
sounds like a simple capture (or sub-expression):

==\([^=]\+\)==

That way, the "=" signs are part of the match, but you can work on the
enclosed material only (with \1).

Best,
Paul

Xell Liu

unread,
Oct 3, 2012, 3:10:57 AM10/3/12
to vim...@googlegroups.com
Hi Paul,

By extraction I mean that what I need to match is only the contents
without the surroundings "==" (i.e. extracting "aaa" from "==aaa==").

I certainly know that I could use \(\) to get the contents afterwards.
But I want to know how to match the contents solely. For example,
while using :match command, the surroundings outside \(\) will be
always highlighted according to your suggestion.

Chris Jones

unread,
Oct 3, 2012, 8:50:18 AM10/3/12
to vim_use
If I understand the question correctly, this is similar to the textbook
problem of matching strings between double quotes:

xxx "aaa" bbbccc "ddd" yyy

If you want to match "aaa" and then "ddd", the solution is something
like:

/"[^"]*"

You need the negated character class to prevent the regex engine from
matching the entire - "aaa" bbbccc "ddd" - string in one big gulp.

Now, at least in Vim, if you want the quotes *not* to be part of the
match, you would be tempted to try:

/"\zs[^"]*\ze"

The problem is that this doesn't do what we hoped for.

The reason, I think, is that if you specify a zero-length match for your
delimiter (double quotes or a pair of equal signs), once a match is
completed, the regex engine does not consider that the ending delimiter
has been consumed, and hence, when looking for the next match, it starts
over where it left off..

So, it will match:

aaa
bbbccc .. that's space+bbbccc+space
ddd

.. which is not what you want: you only want to match aaa and bbb
because although ' bbbccc ' is immediately preceded and followed by
a double quote, it is obviously *not* a quoted string.

Have I understood you correctly..?

I'm not sure there's a way to specify a zero-length match *and* cause
the regex engine to consume the matched characters before it moves to
attempt the next match, but I wouldn't be too optimistic.. because it
feels like you would be asking the regex engine to do two things that
are contradictory to each other.

If you're doing substitutions, could a possible workaround be not to
bother about the zero-length matches in the first place and.. "add them
back" so-to-speak as part of the substitution string..?

| xxx==aaa==bbbccc==ddd==yyy
|
| :%s/==[^=]==/==XXX==/g

.. where aaa and ddd would be replaced by XXX..

CJ

--
Focus follow mouse users will burn in hell!!!

Marvin Renich

unread,
Oct 3, 2012, 8:55:33 AM10/3/12
to vim...@googlegroups.com
* Xell Liu <xell...@gmail.com> [121003 03:18]:
> Hi Paul,
>
> By extraction I mean that what I need to match is only the contents
> without the surroundings "==" (i.e. extracting "aaa" from "==aaa==").
>
> I certainly know that I could use \(\) to get the contents afterwards.
> But I want to know how to match the contents solely. For example,
> while using :match command, the surroundings outside \(\) will be
> always highlighted according to your suggestion.
>
> >> In fact, by pseudocode I think I can put my requirement like this:
> >>
> >> 1. Search for the first pair of "==" from the beginning location where
> >> the search starts.
> >> 2. Extract the contents in the pair of "==" as the first match result.
> >> 3. Disable/Invalidate/Remove the matched contents with its "==" surroundings.
> >> 4. Repeat from 1. until reaching the end of the search range.
> >>
> >> So, is it possible to do that by regular expression? I'm not very
> >> familiar with the concepts of greedy/non-greedy or zero-match (like
> >> the queer things of \@! or \@<= etc.) or . Will they help?

(Please don't top-post.)

The meaning of the terms "extract" and "disable/invalidate/remove" in 2 and
3 above is unclear to me (and apparently others on this list).

It is not clear what your criteria are for what is supposed to match and
what isn't. You seem to want the effects of \zs and \ze, which exclude
the "==" from the part of the text that is considered to match (i.e.
what is highlighted if 'hls' is on, and what would be replaced with
:s//something/). But you say you want to match the "aaa" and "ddd", but
not the "bbbccc". What is not clear is what strings between pairs of
"==" you want to match.

Is it alternating matches, so that if "==" occurs seven times, you match
the first, third, and fifth inner strings, like this:

a==b==c==d==e==f==g==h

would match b, d, and f?

Do you want exactly what is done by /==\zs[^=]\{-}\ze==/ except that you
want to start looking for the next match (e.g. with the normal mode
command 'n') _after_ the trailing "==" of the previous match?

Side note: In the given regex, because you are matching [^=] between
==, there is no difference between greedy and non-greedy. The
expression could be /==\zs[^=]*\ze==/.

...Marvin

Xell Liu

unread,
Oct 3, 2012, 10:04:26 AM10/3/12
to vim...@googlegroups.com
Hi all,

Sorry for the previous mail where my ambiguous expression led to a
somewhat time-wasting discussion. Thanks for the guys who tried to
help. Here is the rephrased version.

I what to use command :match to highlight some text, which is
free-form and thus can not be enumerated. The text is always
surrounded by a pair of "==". I need a regex to match the text.

In the following example, what I want to be highlighted is "aaa" and "bbb"

xxx==aaa==cccddd==bbb==yyy

In the following example, what I want is "a", "c", "e" and "g"

x==a==b==c==d==e==f==g==y

In a nutshell, taking the second example to illustrate, it'd be better
to consider it as "x(a)b(c)d(e)f(g)y" and then highlight the elements
between the pair of "()" but not the ones between ")" and "(" or the
"()" themselves. BTW, there is no alternative result. Because you
can always start from the beginning of the line to make sure that
imaginary substitution is unique.

And, I knew that using /==[^=]\{-}==/ or something similar can offer
the _approximate_ result, however the surroundings "==" are not
excluded as expected.

Ben Fritz and Chris Jones correctly pointed out that \zs\ze is useless
here because it makes the latter "==" constituting another valid
match, by which every element between the pair of "==" will be
highlighted.

Provided that I'd like to insist on using :match, is it possible to
write a regex to meet my requirement?

(According to Jones, due to the mechanism of regex engine, it's highly
improbable...)

Thanks.

Best,
Xell

Chris Jones

unread,
Oct 3, 2012, 10:50:29 AM10/3/12
to vim...@googlegroups.com
On Wed, Oct 03, 2012 at 10:04:26AM EDT, Xell Liu wrote:

[..]

>
> (According to Jones, due to the mechanism of regex engine, it's highly
> improbable...)

Hehe.. Don't take my word for it..

Aside from attempting to clarify your initial post, I tried to restate
the problem in such terms: "Is there something in Vim's regex syntax
that would force it to consume zero-length matches?" In this particular
case, forcing the engine to jump to the next '==' delimiter instead of
what I believe is the behavior that is causing you grief.

More of a hunch than something based on a solid understanding of Vim's
regex engine, I can assure you..

I also thought that reasoning in terms of "..." double-quoted strings
might help, since it's a more common problem than strings delimited by
pairs of equal signs: provided the two problems are indeed equivalent,
it's both easier to describe and more likely to have been solved by
someone more knowledgeable than myself.

CJ

--
Hi! My name is bobby...

Ben Fritz

unread,
Oct 3, 2012, 11:37:36 AM10/3/12
to vim...@googlegroups.com
On Wednesday, October 3, 2012 9:05:14 AM UTC-5, Xell Liu wrote:
> Hi all,
>
>
>
> Sorry for the previous mail where my ambiguous expression led to a
>
> somewhat time-wasting discussion. Thanks for the guys who tried to
>
> help. Here is the rephrased version.
>
>
>
> I what to use command :match to highlight some text, which is
>
> free-form and thus can not be enumerated. The text is always
>
> surrounded by a pair of "==". I need a regex to match the text.
>
>
>
> In the following example, what I want to be highlighted is "aaa" and "bbb"
>
>
>
> xxx==aaa==cccddd==bbb==yyy
>
>
>
> In the following example, what I want is "a", "c", "e" and "g"
>
>
>
> x==a==b==c==d==e==f==g==y
>
>
>

Ok...so you want all "even numbered" things surrounded by ==, correct?

So you have:

{beginning of line}{possible text not to match}=={text to match}=={text not to match}=={text to match}==...

This seems to work for me:

\%(^\|[^=]*==[^=]*==\)\@<=[^=]*==\zs[^=]*\ze==

The trick here is that I match only at positions where either the beginning of the line, or a previous non-match/match pair precede the match.

Probably this pattern could be made simpler but I wasn't able to find a simpler one quickly. The /\@<= is special because unlike \zs, as noted in the :help, "the part of the pattern after '\@<='...[is]...checked for a match first", so I couldn't just drop it and use the \zs by itself.

By the way, you still haven't said what task you're trying to accomplish, beyond that you want to use :match. If syntax highlighting with :syn match would work for you instead, probably the easiest way to highlight these would be using the full ==...== string, and pattern offsets. :he :syn-pattern-offset

Xell Liu

unread,
Oct 4, 2012, 12:07:05 PM10/4/12
to vim...@googlegroups.com
> --

Hi Fritz,

Thanks very much for the solution and, especially, for the shift in
thinking -- I couldn't notice the "even numbered" thing. And your
practical example of using the \@<= teaches me even more.

As to the "task beyond the :match", hmm, I did have something further
and off-topic to do (and I already knew the syn-pattern-offset of
syntax highlight). However, I just saw the regex difficulty as a pure
"technical" challenge and hoped to learn something from it. Now I did
:-)

Best,
Xell

Chris Jones

unread,
Oct 8, 2012, 2:39:32 PM10/8/12
to vim...@googlegroups.com
On Wed, Oct 03, 2012 at 11:37:36AM EDT, Ben Fritz wrote:

[..]

> This seems to work for me:
>
> \%(^\|[^=]*==[^=]*==\)\@<=[^=]*==\zs[^=]*\ze==

> Probably this pattern could be made simpler..

Maybe something like this..?

| \(^[^=]*\|==[^=]*\)==\zs[^=]*

Only briefly tested since I felt the solution to what is after all
a trivial problem shouldn't be that complicated in the first place.

I researched it a little further over the weekend, and eventually, I ran
into this via a perl forum:

| % echo 'ascii string: "string1", unicode string: "κορδόνι"' | perl -wnE 'say for /"[^"]*"/g
| "string1"
| "κορδόνι"

I don't know perl, but it looks like the match on the two sample strings
includes the quotes.

Now, if you add a capturing group¹ around the [^"]* negated character
class that matches the actual strings, this is what you get:

| % echo 'ascii string: "string1", unicode string: "κορδόνι"' | perl -wnE 'say for /"([^"]*)"/g
| string1
| κορδόνι

This time the match does _not_ include the quotes.

Or, with our sample text:

| % echo 'xxx==aaa==bbbccc==ddd==yyy' | perl -wnE 'say for /==[^=]*==/g'
| ==aaa==
| ==ddd==
|
| % echo 'xxx==aaa==bbbccc==ddd==yyy' | perl -wnE 'say for /==([^=]*)==/g'
| aaa
| ddd

So, I tried the same approach with Vim:

| xxx==aaa==bbbccc==ddd==yyy
|
| /==[^=]*==
| /==\([^=]*\)==

But it doesn't make any difference..

Both regexes match '==aaa==' and '==ddd==' including the quotes.

Isn't Vim supposed to mimic perl regexes..?

Or is there something in Vim's regex syntax that would make it work?

CJ

¹ 'sub-expression' in Vim parlance..?

--
WHAT YOU SAY??

Ben Fritz

unread,
Oct 8, 2012, 3:53:29 PM10/8/12
to vim...@googlegroups.com
On Monday, October 8, 2012 1:39:46 PM UTC-5, Chris Jones wrote:
>
> I researched it a little further over the weekend, and eventually, I ran
> into this via a perl forum:
>
> | % echo 'ascii string: "string1", unicode string: "κορδόνι"' | perl -wnE 'say for /"[^"]*"/g
> | "string1"
> | "κορδόνι"
>
> I don't know perl, but it looks like the match on the two sample strings
> includes the quotes.
>
>
> Now, if you add a capturing group¹ around the [^"]* negated character
>
> class that matches the actual strings, this is what you get:
>
> | % echo 'ascii string: "string1", unicode string: "κορδόνι"' | perl -wnE 'say for /"([^"]*)"/g
> | string1
> | κορδόνι
>
> This time the match does _not_ include the quotes.
>

Yes it does. The captured group, now accessible with $1, does not include the quotes. The match does include the quotes. The full match (the equivalent of which gets highlighted in Vim) is accessible in $& (in Vim: \0). But the Perl snippet given only prints out the captured group because of the /g flag. See below.

>
> Or, with our sample text:
>
> | % echo 'xxx==aaa==bbbccc==ddd==yyy' | perl -wnE 'say for /==[^=]*==/g'
> | ==aaa==
> | ==ddd==
> |
> | % echo 'xxx==aaa==bbbccc==ddd==yyy' | perl -wnE 'say for /==([^=]*)==/g'
> | aaa
> | ddd
>

In list context, if there are capturing groups, the match operator /.../g returns a list of all strings where the capture group matches.

"The /g modifier specifies global pattern matching--that is, matching as many times as possible within the string. How it behaves depends on the context. In list context, it returns a list of the substrings matched by any capturing parentheses in the regular expression. If there are no parentheses, it returns a list of all the matched strings, as if there were parentheses around the whole pattern."

http://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators

This Perl is really saying:

for each place where ==([^=]*)== matches, print the captured match

Unlike Perl, Vim cannot access any captured groups outside of the search or substitute command. In other words, Perl can do stuff like this:

$mystr =~ /==(.*)==/;
print $1\n;

Vim cannot.

This is not a regex pattern thing. It's a language thing.

> So, I tried the same approach with Vim:
>
> | xxx==aaa==bbbccc==ddd==yyy
> |
> | /==[^=]*==
> | /==\([^=]*\)==
>
>
> But it doesn't make any difference..
>
> Both regexes match '==aaa==' and '==ddd==' including the quotes.
>

Yes, they both MATCH the quotes. But the capturing group only CAPTURES the text without quotes. The same is true in Perl.

Vim HIGHLIGHTS a match as if you're doing this in Perl:

print "$mystr\n" if ($mystr =~ /"([^"])"/);

Vim CAPTURES a group as if you're doing this in Perl:

$mystr =~ /"([^"])"/;
print "$1\n";

Note that the SAME pattern is used both times, but in a different way.

>
> Isn't Vim supposed to mimic perl regexes..?
>

Not really. Vim has its own dialect. Although Vim regex can do a lot of what Perl's can, it's not a 1-1 match.

>
> Or is there something in Vim's regex syntax that would make it work?
>

Not in the regex syntax. But as discussed it's not Perl's regex syntax allowing it to work in Perl either, it's how the regex is applied. Using the /g flag on a match operator in Perl gives you all matching substrings.

Vim can do something similar with the matchlist() function if you pass a count to it in a loop until the match fails. I'm not sure if there's a more efficient way to extract all matches or not.

Chris Jones

unread,
Oct 9, 2012, 12:43:45 AM10/9/12
to vim...@googlegroups.com
On Mon, Oct 08, 2012 at 03:53:29PM EDT, Ben Fritz wrote:

[..]

Hi Ben,

Thanks much for your comments.

I did run some tests to try and figure out what that perl "g" flag was
for and wrongly concluded -- shades of Vim, I guess, that it meant
something like repeat match until end-of-string.

I think the trouble about regexes is that beyond textbook cases, there
never appears to be a straightforward solution, even to trivial
problems. I remember a thread not so long ago where the OP needed to
collapse empty lines among others..

All rather frustrating.

CJ

--
HOW ARE YOU GENTLEMEN?
Reply all
Reply to author
Forward
0 new messages