[re2-dev] multipattern expression, binary stream for text

175 views
Skip to first unread message

J On

unread,
Apr 21, 2010, 3:09:05 PM4/21/10
to re2-dev
Inspired by Russ's original NFA article, I started down the path of
writing my own regex engine. I'm worried this means that now I have
three problems, and that's one too many.

What I want to do is enable multi-pattern searching of large binary
files. I want to look for "(expr1)|(expr2)|(expr3)|(expr4)...", where
expr1 can still be a full regexp, sans submatching. I expect that the
number of patterns will range in the order of 10^2-10^5, that most
will be fixed-size if not fixed-string, a few will use more aggressive
operators, and matches could be MB in size, but often simply word-
length. Am I barking up the right tree? Should I just specify the
Latin1 option and be off to the races? Will RE2 report overlapping
matches?

A point in the right direction would be much appreciated. RE2 does so
much of what I want to do, but my use-case is a bit weird so it seems
wise to ask.

Thanks,

Jon


--
Subscription settings: http://groups.google.com/group/re2-dev/subscribe?hl=en

Srinivasan Venkatachary

unread,
Apr 21, 2010, 5:01:32 PM4/21/10
to J On, re2-dev
Take a look at FilteredRE2. It allows you to have 1000s of patterns and search them efficiently by pulling out guard strings from each.

Cheenu

J On

unread,
Apr 22, 2010, 10:01:02 AM4/22/10
to re2-dev
Thanks! Going through the code, this is my understanding of how to use
it (playing fast with syntax):

vector<string> atoms;
FilteredRE2 f;

// add the specified patterns to f
for_each(patterns.begin(), patterns.end(), bind(addPatternToFilter,
&f, _1));

// compile f; get the fixed-string factors back out
f.Compile(&atoms);

vector<int> responsiveFactors;

// my own Aho-Corasick, Commentz-Walter, etc
multiStringSearch(text, atoms, responsiveFactors);

// responsiveFactors contains indices of strings found in atoms

if (!responsiveFactors.empty()) {
vector<int> matchPatterns;
f.AllMatches(text, responsiveFactors, &matchPatterns); // pass in
responsiveFactors, right?
}

There are a couple things I see with this.

The first is that I don't see a good way of getting at the matches
themselves (i.e. pairs of position & length). Could I do that by
adding another search function to FilteredRE2 that uses
RE2::FindAndConsume, with an Arg array passed in?

Second, what it looks like this is doing is running only those
patterns that had hits on the factors (good), but running the
associated regexps separately in multiple passes (less good). There
also doesn't seem to be a way to pass in information about where the
factors were found. Would there be a good way to combine the
implicated regexps into a single sub-matching NFA and then making a
single pass? If you were to take a S?WAG, would that be better or
worse than taking multiple passes with the different patterns?

Many, many thanks,


Jon

Srinivasan Venkatachary

unread,
Apr 23, 2010, 3:44:47 PM4/23/10
to J On, re2-dev
The steps you point out for using FilteredRE2 are correct, and also your observation about how it searches each implicated pattern.

Will let Russ respond to the question about RE2::FindAndConsume, as I am not familiar with that part.

thanks,
Cheenu

Russ Cox

unread,
Apr 30, 2010, 2:10:34 AM4/30/10
to J On, re2-dev
> The first is that I don't see a good way of getting at the matches
> themselves (i.e. pairs of position & length). Could I do that by
> adding another search function to FilteredRE2 that uses
> RE2::FindAndConsume, with an Arg array passed in?

Yes, though you'd probably be better off calling Match.

> Second, what it looks like this is doing is running only those
> patterns that had hits on the factors (good), but running the
> associated regexps separately in multiple passes (less good). There
> also doesn't seem to be a way to pass in information about where the
> factors were found. Would there be a good way to combine the
> implicated regexps into a single sub-matching NFA and then making a
> single pass? If you were to take a S?WAG, would that be better or
> worse than taking multiple passes with the different patterns?

It depends quite a bit on the exact problem and context.
The position information would only be useful in some cases,
and analyzing to figure out which ones they are might take
longer than just searching the whole input string.

Russ
Reply all
Reply to author
Forward
0 new messages