RE2::Set

Russ Cox

unread,

Jul 16, 2010, 2:36:07 PM7/16/10

to re2-dev

I don't know if anyone's still on the list after last night's mail storm,
but if so, there's an interesting new class in the re2 library.

RE2::Set lets you build a set of regexps and then run over an
input text just once and identify all the regexps that matched.
It does pretty aggressive factoring of common prefixes among
adjacent regexps, so if you have particularly complicated but
often similar regexps, it's worth sorting them before passing
them to the Set to bring regexps with common prefixes together.

http://code.google.com/p/re2/source/browse/re2/set.h

Russ

brendan

unread,

Jul 22, 2010, 11:58:04 AM7/22/10

to re2-dev

I was playing with this yesterday and it seems like a good first step
towards being able to use RE2 as a lexer, however, I had a few
problems with it.

1. It appears to scan the entire string even if it's anchored and
there are no possible matches after the first few characters.
2. Possibly because of the above, I can't pass in a StringPiece to
SearchDFA to have it only return the matched characters.
3. If I change the match (inside Set::Match) to be kLongestMatch, it
no longer tells me which alternative of the DFA actually matched.

Ideally, what I'd like is to have a function that operates like
Consume which finds the longest match, updates the text to point to
everything past the match, and returns the index of the matched
regular expression in the set.

Is this possible with the current code?

Brendan

Russ Cox

unread,

Jul 22, 2010, 5:49:06 PM7/22/10

to brendan, re2-dev

Yes, the fact that it doesn't stop once all the
possible matches are over is definitely a bug.
Please file one on the issue tracker.
Unfortunately, it's more of a design bug and
not at all easy to fix.

Russ

Ephraim Dan

unread,

Sep 16, 2010, 9:49:03 AM9/16/10

to re2-dev

I am evaluating re2 and only just got started looking at it, but this
"Set" functionality is important to me. I have a few questions on it:

Do I assume that matching a "Set" will only perform a single pass on
the input for all the regexps, and would presumably be significantly
faster than running all the regexps serially on the input?

Is the "Match" method a "FullMatch" or "PartialMatch"? Why doesn't it
have both like the regexps themselves do?

Are "Set"s thread-safe to the same extent that the RE2 objects are?
If I understood the brief note in the docs on RE2 thread-safety, is it
true that multiple threads can be performing matches on a pre-compiled
RE2 simultaneously?

Thanks!
--edan

Elazar Leibovich

unread,

Sep 16, 2010, 10:02:46 AM9/16/10

to Ephraim Dan, re2-dev

(Please note, I'm not an "official" developer, I'm hacking around and my understanding of the code is not perfect.)

If I understand the code correctly, the Set is just a compiler, which produces the same Prog structure. So I don't see any reason that running a Prog produced by the Set would be different than a Prog produced by Regex.

BTW, please note that your regular expression is not guaranteed to be ran by a regular expression. Even without subexpression, there engine might decided there's a better way.

Russ Cox

unread,

Sep 16, 2010, 10:47:05 PM9/16/10

to Ephraim Dan, re2-dev

> Do I assume that matching a "Set" will only perform a single pass on
> the input for all the regexps, and would presumably be significantly
> faster than running all the regexps serially on the input?

Yes. That's exactly the reason to have it. It can be a big win.

> Is the "Match" method a "FullMatch" or "PartialMatch"? Why doesn't it
> have both like the regexps themselves do?

Match is whatever you specify in the Set constructor:

enum Anchor {
UNANCHORED, // No anchoring
ANCHOR_START, // Anchor at start only
ANCHOR_BOTH, // Anchor at start and end
};

> Are "Set"s thread-safe to the same extent that the RE2 objects are?

Yes.

> If I understood the brief note in the docs on RE2 thread-safety, is it
> true that multiple threads can be performing matches on a pre-compiled
> RE2 simultaneously?

Yes.

Russ

Ephraim Dan

unread,

Sep 20, 2010, 4:09:13 AM9/20/10

to re2-dev

Very cool - so UNANCHORED corresponds to PartialMatch - I see.

It appears that matching with a Set does not provide sub-string info.
Is this something you plan on adding? In the near future? We'd
really like the multi-match capability, but we need substrings.

Also in general regarding substrings - the comments in re2.h say that
this is a big performance hit - how bad? Is it something you plan on
addressing? This is also potentially a big problem for us. Any
guidance you can give here is most helpful.

Also regarding thread-safety - have you run any tests on scaling to
multi-cores? How lock-heavy is the implementation?

Thanks!

Russ Cox

unread,

Sep 22, 2010, 1:59:05 PM9/22/10

to Ephraim Dan, re2-dev

On Mon, Sep 20, 2010 at 4:09 AM, Ephraim Dan <eda...@gmail.com> wrote:
> Very cool - so UNANCHORED corresponds to PartialMatch - I see.
>
> It appears that matching with a Set does not provide sub-string info.
> Is this something you plan on adding? In the near future? We'd
> really like the multi-match capability, but we need substrings.

I don't plan to add it. You'd have to do a second pass to
get substring information, and it's just as efficient to do it
in the calling code as it would be in the RE2::Set.

> Also in general regarding substrings - the comments in re2.h say that
> this is a big performance hit - how bad? Is it something you plan on
> addressing? This is also potentially a big problem for us. Any
> guidance you can give here is most helpful.

The hit depends on the type of regular expression but is
quite fundamental to the problem: it's computationally harder.
The only way to test is to measure on your particular workload.

> Also regarding thread-safety - have you run any tests on scaling to
> multi-cores? How lock-heavy is the implementation?

The implementation runs well on multicore machines at Google.
The places where lock contention was heaviest have been
rewritten to avoid them.

Russ

Reply all

Reply to author

Forward