Regexp (?!re)

13,340 views
Skip to first unread message

and...@default-value.com

unread,
May 20, 2012, 5:21:57 PM5/20/12
to golan...@googlegroups.com
I see that (?!re) regular expression is not supported neither in re2 nor in Google Go. Is there any chance that its support will be implemented in future releases? (it is supported at least in Ruby, Python, Java, Perl)

Andrew Gerrand

unread,
May 20, 2012, 10:44:57 PM5/20/12
to and...@default-value.com, golan...@googlegroups.com
You'll need to be more specific. (?!re) doesn't mean anything to most people.

I suspect this is a feature provided by PCRE, which is why it is
available in the languages you mentioned.

Andrew

and...@default-value.com

unread,
May 20, 2012, 11:04:31 PM5/20/12
to golan...@googlegroups.com, and...@default-value.com
Sorry, I didn't know about this feature recently. But faced a problem because one piece of code cannot be ported from java due to this issue. I am referring to (?!re) before text not matching re (NOT SUPPORTED) in http://code.google.com/p/re2/wiki/Syntaxhttp://code.google.com/p/re2/wiki/Syntax.

Ian Lance Taylor

unread,
May 20, 2012, 11:22:49 PM5/20/12
to and...@default-value.com, golan...@googlegroups.com
?! is a lookahead assertion. I'm not aware of any plans to add them to
re2. To be honest I don't see much use for them when used by a
programming language like Go. They are more useful when used with
something like perl or sed that has builtin support for streaming
through the file.

Ian

Kevin Ballard

unread,
May 21, 2012, 3:13:13 PM5/21/12
to Ian Lance Taylor, and...@default-value.com, golan...@googlegroups.com
Lookahead (and, less so, lookbehind) assertions are actually extremely useful when crafting powerful regular expressions even in non-streaming contexts.

For example, if I want to use a regular expression to parse an HTML document (yes yes, bad idea, but bear with me), and I'm interested in grabbing the very first <pre> tag, I might write a regular expression that looks like

    <pre>.*</pre>

This is great, unless there's two <pre> tags in the document. Now I might adjust it to look like

    <pre>.*?</pre>

so I only get the first. But what if the <pre> tag actually has another <pre> tag nested in it? At the time I wrote this regex, perhaps this never happened, but the document was modified later. I'd rather have my regex fail then give me bad data, so how do I do that? This is actually not that hard if we have the atomic operator (?>…) (called the possessive operator in the re2 syntax), except re2 doesn't support that either, so we have to craft a very weird pattern (and we're simplifying here by assuming there's no whitespace in the tag)

    <pre>(?:<(?:[^p]|p[^r]|pr[^e]|pre[^>]|$)|[^<])*</pre>

(I make no guarantees about the correctness of the pattern)

The atomic version looks like

    <pre>(?:(?><pre>)\A|.)*</pre>

Here we atomically match <pre>, followed by a non-matching token (\A, which we know cannot match at this point). The atomic operator means that once we encounter <pre> we cannot backtrack, and thus the pattern is guaranteed to fail.

The version with the lookahead operator is rather trivial

    <pre>(?:(?!<pre>).)*</pre>

Here we just test for a lookahead on <pre> before consuming each character.

This is actually a real-world example. I have some source code right now that pulls a document from the web and parses out the content of the first <pre> tag. I tried using an HTML parser but the go-html-transform package can't handle this document (filed as issue #5), and I'm not building Go from source so I don't have the exp/html package. Luckily the document in question has a very simple structure that hasn't changed in years, so I don't feel so bad about using string parsing. However, I would like to be able to have my regular expression error out if the document does, in fact, change in structure, but I don't want to introduce the unreadable mess that is the 3rd pattern into my code.

-Kevin

Kyle Lemons

unread,
May 21, 2012, 3:25:23 PM5/21/12
to Kevin Ballard, Ian Lance Taylor, and...@default-value.com, golan...@googlegroups.com
I find that such constructs only become necessary when I am trying to use a regular expression when I should be using a parser.

 
-Kevin

Ian Lance Taylor

unread,
May 21, 2012, 4:19:26 PM5/21/12
to Kevin Ballard, and...@default-value.com, golan...@googlegroups.com
Kevin Ballard <kbal...@gmail.com> writes:

> For example, if I want to use a regular expression to parse an HTML
> document (yes yes, bad idea, but bear with me), and I'm interested in
> grabbing the very first <pre> tag, I might write a regular expression
> that looks like

Yes, you might, but then you said right from the start that it is a bad
idea. And it is.

And if you do really need it, you could write a Go function to translate
your preferred regexp into a regexp that re2 will accept. Or you could
use two regexps. Or you could use a loop.

So your example is not a terrible example, but I don't find it to be a
convincing one either.

Not that I'm particularly opposed to adding (?!re) to Go's regexp
package if somebody wants to do the work. I don't know Russ's opinion.

Ian

Kevin Ballard

unread,
May 21, 2012, 4:47:21 PM5/21/12
to Ian Lance Taylor, and...@default-value.com, golan...@googlegroups.com
I picked this example because it's the most recent regexp I've written. Yes, what I probably should do is stick with the <pre>.*?</pre> regexp and then just double-check the resulting string to see if <pre> shows up anywhere inside. But the point I was trying to make is that (?!) (and the other variations) allow for more powerful regular expressions that are useful in contexts beyond streaming.

Sure, people abuse regular expressions to do things that would be better-served with real lexers. But providing or withholding various bits of "standard" re syntax isn't going to change that.

Based on your original assertion about the utility of lookaheads being more useful in streaming contexts, what if I want to write a sed-alike in Go? Or really any program where you get user-provided regular expressions, then lookbehind/lookahead assertions are useful.

In any case, I don't have a problem as long as nobody is actively opposed to adding lookahead/lookbehind assertions if someone finds the time to implement them.

-Kevin

Russ Cox

unread,
May 21, 2012, 5:14:59 PM5/21/12
to Kevin Ballard, Ian Lance Taylor, and...@default-value.com, golan...@googlegroups.com
On Mon, May 21, 2012 at 4:47 PM, Kevin Ballard <kbal...@gmail.com> wrote:
> In any case, I don't have a problem as long as nobody is actively opposed to
> adding lookahead/lookbehind assertions if someone finds the time to
> implement them.

The lack of generalized assertions, like the lack of backreferences,
is not a statement on our part about regular expression style. It is
a consequence of not knowing how to implement them efficiently. If
you can implement them while preserving the guarantees made by the
current package regexp, namely that it makes a single scan over the
input and runs in O(n) time, then I would be happy to review and
approve that CL. However, I have pondered how to do this for five
years, off and on, and gotten nowhere.

Russ

cgea...@gmail.com

unread,
Jun 16, 2013, 10:35:29 AM6/16/13
to golan...@googlegroups.com, and...@default-value.com
How about validating static text, like Forsyth-Edwards Notation for a chessboard? A look-behind would be nice to make sure that two numbers aren't adjacent in the position string.

Metal3d

unread,
Aug 19, 2013, 1:19:03 PM8/19/13
to golan...@googlegroups.com, Kevin Ballard, Ian Lance Taylor, and...@default-value.com, r...@golang.org
I disagree with Russ about the invoked reason to not implement "negative". I understand the fact that optimisation is a priority, but while there is probably no possibility to get result in O(n) time this is strongly problematic to not implement that kind of possibility.
I'm currently implementing the "VerbalExpression" system in Go, and I will be not able to implement "Not()" function unlike other languages. Perl, PHP, Haskell, Javascript, Java, C++ etc... will be able to implement the method that use "(?!XXX)" syntax (because they use PCRE library)
If Go was made to create very low level programs, having regexp and/or methods to easilly find patterns has no place. Working with lexers and parsers is the way to go. 
But Go is (stop me if I'm wrong) a language made to create client/server programs. Regexp is not the best way to parse text, but for some high level implementation it's very effective. But if we haven't full basics methods... developpers who are like me accustomed to using PCRE patterns will be severely handicapped.

Believe me, Go software developers have no problem with that "?!" does not respond in O(n) time from the moment the method works well...

Aram Hăvărneanu

unread,
Aug 19, 2013, 1:25:25 PM8/19/13
to Metal3d, golang-nuts, Kevin Ballard, Ian Lance Taylor, and...@default-value.com, Russ Cox
> Believe me, Go software developers have no problem with that "?!" does not
> respond in O(n) time from the moment the method works well...

As a Go developer I refute your assumption.

--
Aram Hăvărneanu

Didier Spezia

unread,
Aug 19, 2013, 1:35:04 PM8/19/13
to golan...@googlegroups.com, Kevin Ballard, Ian Lance Taylor, and...@default-value.com, r...@golang.org

>> developpers who are like me accustomed to using PCRE patterns will be severely handicapped.

Well, if you really want to stick to PCRE, you can still use it from Go.
Didier.

Metal3d

unread,
Aug 19, 2013, 6:58:52 PM8/19/13
to golan...@googlegroups.com, Metal3d, Kevin Ballard, Ian Lance Taylor, and...@default-value.com, Russ Cox
@aram Are you Go developper or Go Software developper, because the whole developpers I know (and other langage) prefer to have a method that works well instead of no method only because it doesn't respond in O(n) time. When you cannot do better, a working method (well designed) is better than nothing.

@Didier I know this library, but having multiple dependencies for VerbalExpression is a pitty. Everything works with built-in excepting the "Not()" method only because there is no way to have O(n) time on negative search... and as far as I know, there is no way to get this performance on that kind of search. If the algorithmic time is the only limit... there will be a lot of code that should be removed from a lot of libraries.

I really insist, my speak is not agressive at all.

Paddy Foran

unread,
Aug 19, 2013, 7:02:05 PM8/19/13
to Aram Hăvărneanu, golang-nuts
You must not be a True Scotsman, then.



--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ian Lance Taylor

unread,
Aug 19, 2013, 7:14:54 PM8/19/13
to Metal3d, golang-nuts, Kevin Ballard, and...@default-value.com, Russ Cox
On Mon, Aug 19, 2013 at 3:58 PM, Metal3d <met...@gmail.com> wrote:
> @aram Are you Go developper or Go Software developper, because the whole
> developpers I know (and other langage) prefer to have a method that works
> well instead of no method only because it doesn't respond in O(n) time. When
> you cannot do better, a working method (well designed) is better than
> nothing.

I think Russ said pretty clearly over a year ago the requirements that
we want for the regexp package and hence the kind of patch that would
be accepted. I don't see any reason to change those requirements.

In general the Go language and libraries do not add features merely
because people find them useful. The features must in addition be
well-designed and fit in well with other aspects of the language and
libraries.

Ian

Metal3d

unread,
Aug 19, 2013, 7:46:09 PM8/19/13
to golan...@googlegroups.com, Metal3d, Kevin Ballard, and...@default-value.com, Russ Cox
@Ian What I meat is that the problem is the invoked reason: 
"The lack of generalized assertions, like the lack of backreferences, is not a statement on our part about regular expression style.  It is a consequence of not knowing how to implement them efficiently."

This sentence is like "we've did it, that worked ,but we decided to deprive yourself because we do not know the best way to do it"... This is the taste it leaves.

It's like a car seller saying "ok, I give you this Porche but you don't have weels, because the wheels we've got was not fully efficients". Yes, it's a bit exaggerated :) But backreferences and negative tests on regexp are really usefull... 

I'd like to try to develop those assertions... but the regexp code is a bit complicated (for me) and I have not the time to spend a lot if time on this. Add the fact I'm not a very good programmer, and you will understand why I will not be able to help ;)

Jesse McNelis

unread,
Aug 19, 2013, 7:50:51 PM8/19/13
to Metal3d, golang-nuts, Kevin Ballard, and...@default-value.com, Russ Cox
On Tue, Aug 20, 2013 at 9:46 AM, Metal3d <met...@gmail.com> wrote:
It's like a car seller saying "ok, I give you this Porche but you don't have weels, because the wheels we've got was not fully efficients". Yes, it's a bit exaggerated :) But backreferences and negative tests on regexp are really usefull... 

A regular expression that doesn't complete in O(n) time isn't a regular expression.
If you want to parse non-regular grammars then you want a different kind of parser. 

It's like a car seller selling you a car and you're complaining that it doesn't have wings.

Ian Lance Taylor

unread,
Aug 19, 2013, 8:00:26 PM8/19/13
to Metal3d, golang-nuts, Kevin Ballard, and...@default-value.com, Russ Cox
On Mon, Aug 19, 2013 at 4:46 PM, Metal3d <met...@gmail.com> wrote:
> @Ian What I meat is that the problem is the invoked reason:
> "The lack of generalized assertions, like the lack of backreferences, is not
> a statement on our part about regular expression style. It is a consequence
> of not knowing how to implement them efficiently."
>
> This sentence is like "we've did it, that worked ,but we decided to deprive
> yourself because we do not know the best way to do it"... This is the taste
> it leaves.
>
> It's like a car seller saying "ok, I give you this Porche but you don't have
> weels, because the wheels we've got was not fully efficients". Yes, it's a
> bit exaggerated :) But backreferences and negative tests on regexp are
> really usefull...

Sorry if it gives you that impression. The real reason is that the
current regexp package gives you a guarantee: it will make a single
scan over the input and run in O(n) time, where n is the size of the
input. To see why that is important, please see Russ's regexp
articles at http://swtch.com/~rsc/regexp/ . We aren't going to break
that guarantee in order to provide some feature, because that would
mean that Go's regexp implementation has the same serious problem as
PCRE: unexpectedly bad performance for some very simple cases that
arise naturally.

Ian

Kyle Lemons

unread,
Aug 19, 2013, 8:00:57 PM8/19/13
to Jesse McNelis, Metal3d, golang-nuts, Kevin Ballard, Andrey Romanov, Russ Cox
Instead of making a *regexp.Regexp, make your own type that includes a "Negate bool" field.  Then you can reimplement the methods you want and include the boolean inversion.
But wings are so useful!  Airplanes have them, why can't my car? 

Metal3d

unread,
Aug 19, 2013, 8:13:10 PM8/19/13
to golan...@googlegroups.com, Metal3d, Kevin Ballard, and...@default-value.com, Russ Cox
Le mardi 20 août 2013 02:00:26 UTC+2, Ian Lance Taylor a écrit :
On Mon, Aug 19, 2013 at 4:46 PM, Metal3d <met...@gmail.com> wrote:
> @Ian What I meat is that the problem is the invoked reason:
> "The lack of generalized assertions, like the lack of backreferences, is not
> a statement on our part about regular expression style.  It is a consequence
> of not knowing how to implement them efficiently."
>
> This sentence is like "we've did it, that worked ,but we decided to deprive
> yourself because we do not know the best way to do it"... This is the taste
> it leaves.
>
> It's like a car seller saying "ok, I give you this Porche but you don't have
> weels, because the wheels we've got was not fully efficients". Yes, it's a
> bit exaggerated :) But backreferences and negative tests on regexp are
> really usefull...

Sorry if it gives you that impression.  

No problem :)
 
The real reason is that the
current regexp package gives you a guarantee: it will make a single
scan over the input and run in O(n) time, where n is the size of the
input.  

Ok, I didn't understood that regexp has those garanties. I though that O(n) was a recommandation, not a rule
 
To see why that is important, please see Russ's regexp
articles at http://swtch.com/~rsc/regexp/ .  We aren't going to break
that guarantee in order to provide some feature, because that would
mean that Go's regexp implementation has the same serious problem as
PCRE: unexpectedly bad performance for some very simple cases that
arise naturally.

That's right that PCRE have sometimes some bad preformances for that features. I tried to implement "not" system (in a separated package to test), I only had O(n²) and maybe another with O(n*log(n))... no way to get O(n) AFAIK
 

Ian



Kevin Gillette

unread,
Aug 19, 2013, 11:33:53 PM8/19/13
to golan...@googlegroups.com, Kevin Ballard, Ian Lance Taylor, and...@default-value.com, r...@golang.org
On Monday, August 19, 2013 11:19:03 AM UTC-6, Metal3d wrote:
Believe me, Go software developers have no problem with that "?!" does not respond in O(n) time from the moment the method works well...

I have a very serious problem with any exponential time construct making it into the stdlib regexp package, both on a theoretical basis (for one thing, anything with such an assertion not actually "regular" expression) and because some safe usecases for the regexp package would become security holes if there exists a pattern string that could take more than linear time.

If you want PCRE features, you are certainly free to either: 1) fork the regexp package and add those features for your own use (or find someone that already has done so), or 2) use cgo to link in libpcre.

Metal3d

unread,
Aug 20, 2013, 12:26:24 PM8/20/13
to golan...@googlegroups.com, Jesse McNelis, Metal3d, Kevin Ballard, Andrey Romanov, Russ Cox


Le mardi 20 août 2013 02:00:57 UTC+2, Kyle Lemons a écrit :
Instead of making a *regexp.Regexp, make your own type that includes a "Negate bool" field.  Then you can reimplement the methods you want and include the boolean inversion.


That's a way I try to follow since 2-3 days. But I really don't have any idea on how to implement a efficient method at this time
 

Kyle Lemons

unread,
Aug 20, 2013, 1:52:41 PM8/20/13
to Metal3d, golang-nuts, Jesse McNelis, Kevin Ballard, Andrey Romanov, Russ Cox
Then fall back to one of the libpcre variants?


--

叶方

unread,
Apr 2, 2016, 10:29:05 PM4/2/16
to golang-nuts, kbal...@gmail.com, ia...@google.com, and...@default-value.com, r...@golang.org

 +1  

I took more than one hour to debug my code which contains zero-length assertions, finally find this thread and know the cause is Golang doesn't support them ...

In a very sort of pragmatic sense, Golang should provide the feature and leave the choice to developers.

在 2013年8月20日星期二 UTC+8上午1:19:03,Metal3d写道:

Douglas Clark

unread,
Apr 4, 2016, 7:13:35 PM4/4/16
to golang-nuts
If you really need lookarounds you could use this regexp engine I ported to Go: http://github.com/dlclark/regexp2
However, I would advise caution as there are benefits to the re2-based engine that are not provided by more full featured engines with lookarounds. If you have the option then stick with the stdlib.
Reply all
Reply to author
Forward
0 new messages