Alternate implementations of regular expression matching?

309 views
Skip to first unread message

Bob Alexander

unread,
Apr 1, 2021, 3:54:35 PM4/1/21
to golang-nuts
I have a use of regular expressions where I need the "atomic" matching features that I can get for several other languages, such as atomic matching groups and "possessive" quantifiers. I'm willing to forego the blazing performance of Go's standard library dfa matcher for this particular case. I cannot write an equivalent regexp for Go's "regexp", and the additional code I would have to write for equivalent functionality is prohibitive.

I did find such an implementation in the ether, importable as "github.com/h2so5/goback/regexp", It kind of works, but fails for my use case.

Is there any reason to believe that downloaded modules such as this are well-tested and reliable?

To pursue this problem,, I wrote small programs that perform the same operation in isolation in some other languages that have regular expression implementations the have the features I want, in particular
   Java, whose java.util.regex module provides the desired features
   Python, which has a module "regex" downloadable from the PyPI (Python program index) that replaces the standard "re" module but has my desired features
   Also Go, using the downloaded module I mentioned

Only the Go version fails. Of course, it's the fault of the downloaded library, not Go itself :)

Anyone know of any other Go nfa implementations with the aforementioned features?

As much as I appreciate the advantages Go's dfa "regexp" module, there are times when a more full-featured nfa implementation is useful. (And, a well-written regexp for nfa can be pretty fast if the aforementioned features are used to prevent excessive backtracking.)

If anyone is interested in looking into the failure I experienced, I'd be happy to send the little programs I wrote to investigate this.


Brian Candler

unread,
Apr 1, 2021, 4:21:14 PM4/1/21
to golang-nuts
If you can live with cgo dependencies, then you might want to look at a go wrapper around PCRE, which is the "go-to" implementation for fully-featured regexps.

e.g. https://github.com/rubrikinc/go-pcre  (untested by me, and it's one of several forks)

Amnon

unread,
Apr 1, 2021, 6:36:17 PM4/1/21
to golang-nuts
Worth mentioning Russ Cox's series of blog posts on Regular Expressions 
and their implementation https://swtch.com/~rsc/regexp/

Marcin Romaszewicz

unread,
Apr 1, 2021, 6:49:48 PM4/1/21
to Amnon, golang-nuts
I have nothing useful to contribute to this discussion, however, I've been doing this long enough to remember Jamie Zawinski's quote about regular expressions :)

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/bb9db3d5-f9da-4324-80bf-48fbcc43b29en%40googlegroups.com.

Doug

unread,
Apr 1, 2021, 8:43:00 PM4/1/21
to golang-nuts
You can try my regexp2 engine that's based on the .NET engine: https://github.com/dlclark/regexp2

It supports atomic groups, has no dependencies, and doesn't need cgo. Unfortunately it doesn't support Possessive Quantifier syntax, but supposedly that may be serving as syntactic sugar for atomic groups and look-ahead assertions (which it does support).

Thanks,
Doug

Amnon

unread,
Apr 2, 2021, 7:12:42 AM4/2/21
to golang-nuts
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

I would tend to agree with this statement. Regular expressions are much abused and over-used in situations where 
simpler parsing techniques would be more appropriate. Beyond a certain length regular expressions are hard to read,
and easy to get wrong.

But when you do need to use a regular expression, then it the implementation in the standard library is usually a good choice.
Its most important feature is O(n) runtime - runtime proportional to the input length, irrespective of the complexity of 
the regular expression. This prevents denial of service attacks based on sending regular expression which cause 
catastrophic backtracking and exponential runtime. The standard library omits some of the bells and whistles 
provided by PCRE and other libraries - but often for the good reason that these require backtracking, so are impossible
to implement while guaranteeing reasonable runtimes. 

So I would certainly consider using the standard library version, and rewriting your regular expressions to remove the use
of features not supported in the standard library.

bobj...@gmail.com

unread,
Apr 3, 2021, 12:30:54 PM4/3/21
to golang-nuts
In *my* ideal world, both O(n) and backtracking implementations would be available in the standard library.

Most popular languages have only backtracking versions in their standard libraries (Java, Python, Ruby). It's nice that Go has an O(n) implementation. But, some regexes that can be written in the modern backtracking implementations that support "atomic groups", etc. just cannot be rewritten for O(n) implementations, sometimes resulting in having to write a lot of code to achieve the equivalent result. The reason that O(n) implementations lack those features is that they simply can't be done without violating O(n). So why not let the programmer exercise the choice -- O(n) where possible for denial-of-service safety or maybe better performance, and backtracking when O(n) can't be done or maximum performance is not important, and regexes incoming from the wild do not happen.

And... I suspect that well written regexes for backtracking can be nearly as performant if backtracking is kept under control via skillful use of those atomic features. It's the sloppy ones that are the problem.

Artur Vianna

unread,
Apr 3, 2021, 12:49:56 PM4/3/21
to bobj...@gmail.com, golang-nuts
An interesting approach would be the implementation of a regex engine that has optimizations just like a compiler (i think the stdlib one does that to an extent). Using powerset and hopcroft's where it can be used and leaving backtracks where it can't.

That would be fun to implement but very confusing for the user and has the same security problems of a NFA backtracking regex, but the user would be able to choose the backtracking features, being aware that they pose security problems.

Powerset algorithm is O(n²) too, so while converting the NFA to DFA some strings like (a*b*)^n can take forever basically, a possible security problem if your regex came from user input.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Artur Vianna

unread,
Apr 3, 2021, 12:53:22 PM4/3/21
to bobj...@gmail.com, golang-nuts
Oops sorry, Powerset algo is actually O(n*2^n) time complexity
Reply all
Reply to author
Forward
0 new messages