Speeding up multiple regexp matches

2,625 views
Skip to first unread message

Raj

unread,
Jul 27, 2016, 4:11:35 AM7/27/16
to golang-nuts
Not specific to golang. Incidentally I am using go to develop this particular tool.

I have a piece of text that I want to test against a large number of regular expressions, where a different action is taken based on which regexps successfully
matched. The naive approach is to loop through each regexp and if matches do the corresponding action and continue.

I thought of combining all of the regular expressions into one massive regexp, and let the regexp state machine do all the discriminating. The problem with
this is that it gives you no way to determine which regexps were the matched among all.

Any suggestions?


Val

unread,
Jul 27, 2016, 5:05:37 AM7/27/16
to golang-nuts
How many is a large number M of regexps?
Are the regexps complex, are you trying to capture groups (parenthesized subexpressions)?
Are the regexps long, and are they similar?

It may be the case that the naive approach is already the most efficient for your use case.
It is also possible that the combined massive regexp would suffer from exponential growth in memory, though I'm not sure about that.

If you were interested in only the first matching regexp, you could use binary search, for a total cost of log(M) massive regexp tests.

This problem sounds very interesting, please share sample data so we can consider sensible approaches and run benchmarks.
Cheers
Val

Raj

unread,
Jul 27, 2016, 5:57:15 AM7/27/16
to golang-nuts
  • There are more than 10,000 regexps.
  • I need to do this task on ~ 800 million strings having max length 1KB each, and this number growing daily.
  • Currently regular expressions are simple. It's actually what MS SQL Server like operator supports. i.e. "_%[]^". See https://msdn.microsoft.com/en-us/library/ms179859.aspx
  • I am interested in getting set of all matched regexps.

Raj

unread,
Jul 27, 2016, 6:02:58 AM7/27/16
to golang-nuts
Is it possible to build an NFA/DFA from a list of regular expressions which can give the desired result i.e. list of regexps which matched the input string?
Are there any existing libs for this? Do anybody konwof any existing algos for it, which I can use to implement it?

Egon

unread,
Jul 27, 2016, 6:31:55 AM7/27/16
to golang-nuts
If the regexps aren't dynamic then Ragel (http://www.colm.net/open-source/ragel/).
There's https://github.com/BurntSushi/rure-go, which might work better than Go-s regexp.

There might be interesting approaches depending on the regexp and input data.

Also, show example data and some regexps that are representative of the data you are running it on.

+ Egon

Ganbold Tsagaankhuu

unread,
Jul 27, 2016, 6:39:50 AM7/27/16
to Egon, golang-nuts
Maybe try Hyperscan?


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andy Balholm

unread,
Jul 27, 2016, 12:43:44 PM7/27/16
to Raj, golang-nuts
in github.com/andybalholm/redwood, one thing I do is to check each URL against a (potentially very large) set of regular expressions. Since these regular expressions generally contain fairly significant amounts of literal text, I analyze each regular expression to see if I can find a set of strings such that every URL that matches that regular expression must contain at least one of those strings. (This is done in the file restring.go.) I combine the sets of strings from all the regular expressions, and do an Aho-Corasick string search based on that list of strings. From the results of the string search, I know which regexps are possible matches, and I test those against the URL.

This is all wrapped up in the regexMap type in url.go.

Andy

Caleb Spare

unread,
Jul 27, 2016, 12:59:32 PM7/27/16
to Andy Balholm, Raj, golang-nuts
On Wed, Jul 27, 2016 at 9:42 AM, Andy Balholm <andyb...@gmail.com> wrote:
in github.com/andybalholm/redwood, one thing I do is to check each URL against a (potentially very large) set of regular expressions. Since these regular expressions generally contain fairly significant amounts of literal text, I analyze each regular expression to see if I can find a set of strings such that every URL that matches that regular expression must contain at least one of those strings. (This is done in the file restring.go.) I combine the sets of strings from all the regular expressions, and do an Aho-Corasick string search based on that list of strings. From the results of the string search, I know which regexps are possible matches, and I test those against the URL.

​By coincidence, I solved this exact problem as well earlier this week​, and my code is almost the same as yours (and rsc's) -- especially the deconstruction of the syntax.Regex AST.

I used Rabin-Karp for the search. Thanks for the pointer to Aho-Corasick; that's a very nice algorithm.

-Caleb

Raj

unread,
Jul 28, 2016, 2:18:29 AM7/28/16
to golang-nuts, andyb...@gmail.com, rajenderre...@gmail.com
Thank you for excellent suggestions. I will explore and try them.


I am wondering is it possible to single NFA/DFA from set of regular expressions (similar to aho-corasick which does for fixed strings) which can determine matched regexps with single pass on the input. Especially if regular expressions include following meta characters only.

.              any character
.*             any string including empty string
[xyz]          character class
[^xyz]         negated character class


Tamás Gulácsi

unread,
Jul 28, 2016, 3:32:41 AM7/28/16
to golang-nuts
Just concatenate them with |.

Raj

unread,
Jul 28, 2016, 3:54:27 AM7/28/16
to golang-nuts
But that won't give me list of all matched regexps. No?

On Thursday, July 28, 2016 at 12:32:41 AM UTC-7, Tamás Gulácsi wrote:
Just concatenate them with |.

Tamás Gulácsi

unread,
Jul 28, 2016, 8:35:13 AM7/28/16
to golang-nuts
Use Submatch, concatenate the () groups, count inner opening parenthesis to map Submatch index to matching subexpression.

Val

unread,
Jul 28, 2016, 9:30:09 AM7/28/16
to golang-nuts
Hello Tamás,
would you provide an sample impl in playground,  e.g.   combining patterns "ab", "ac", "bc" into "(ab)|(ac)|(bc)"  and then matching input string "zabc" ?
We're expecting something like [true, false, true], or ["ab", "bc"], or equivalent.

I tried but got stuck because  :
- returned indexes denote positions in the matched string, not in the regexp,
- Submatch methods (without All) consider only the leftmost match,
- AllSubmatch method consider only non-overlapping matches, which defeats the purpose of matching the same input fragment 10000 times.

Using concatenation to build an uber-regexp (10000x longer) makes perfect sense, I think.  On the other hand, I don't contemplate concatenating 10000 times the input string (wouldn't be more efficient than the naive loop approach). Also, we won't try to concatenate 800 millions of something into 1 single string.

Tamás Gulácsi

unread,
Jul 28, 2016, 10:18:40 AM7/28/16
to golang-nuts
You're right: the non-overlapping feature disables this approach.
Then only a possible thought remains: group the regexps by their starts, like a true, and match different branches by concatenation, but differentiate with loops.
I don't know whether this is feasible or sane at all, sorry.

gtow...@gmail.com

unread,
Jul 28, 2016, 1:07:47 PM7/28/16
to golang-nuts
A couple of years ago I worked on a project, in Go, to parse multiple regular expressions into a single DFA (state machine).
There's a demo website at 
The source code for everything (website + underlying library) is on GitHub at

Gregg Townsend
University of Arizona (retired)
Reply all
Reply to author
Forward
0 new messages