RE2::Set lets you build a set of regexps and then run over an
input text just once and identify all the regexps that matched.
It does pretty aggressive factoring of common prefixes among
adjacent regexps, so if you have particularly complicated but
often similar regexps, it's worth sorting them before passing
them to the Set to bring regexps with common prefixes together.
http://code.google.com/p/re2/source/browse/re2/set.h
Russ
Russ
Yes. That's exactly the reason to have it. It can be a big win.
> Is the "Match" method a "FullMatch" or "PartialMatch"? Why doesn't it
> have both like the regexps themselves do?
Match is whatever you specify in the Set constructor:
enum Anchor {
UNANCHORED, // No anchoring
ANCHOR_START, // Anchor at start only
ANCHOR_BOTH, // Anchor at start and end
};
> Are "Set"s thread-safe to the same extent that the RE2 objects are?
Yes.
> If I understood the brief note in the docs on RE2 thread-safety, is it
> true that multiple threads can be performing matches on a pre-compiled
> RE2 simultaneously?
Yes.
Russ
I don't plan to add it. You'd have to do a second pass to
get substring information, and it's just as efficient to do it
in the calling code as it would be in the RE2::Set.
> Also in general regarding substrings - the comments in re2.h say that
> this is a big performance hit - how bad? Is it something you plan on
> addressing? This is also potentially a big problem for us. Any
> guidance you can give here is most helpful.
The hit depends on the type of regular expression but is
quite fundamental to the problem: it's computationally harder.
The only way to test is to measure on your particular workload.
> Also regarding thread-safety - have you run any tests on scaling to
> multi-cores? How lock-heavy is the implementation?
The implementation runs well on multicore machines at Google.
The places where lock contention was heaviest have been
rewritten to avoid them.
Russ