One of the reasons I want to use RE2 is to solve the potential
> 1) The RE2 engine should fall back to NFA if the DFA has failed. So it
> should be able to handle whatever you throw at him (with limitation of space
> and time - certain regexps have exponential running time).
exponential running time that we get with pcre. Is there no way to
guarantee avoiding this with RE2 as well? Are there other ways to
mitigate?
I don't quite follow. But I am getting these errors at match time,
> 2) I'm not sure the cost of computing the size of the DFA of a certain
> regular expression, is cheaper than the cost of computing the actual DFA.
not compile time. So I want to know if there is any way to know
without actually performing the match, how much memory the DFA might
take?
It is a set of about 1000 regular expressions.
> 3) What was the regular expression you were trying to use?
That's probably fine. I should make the warning less aggressive.
> Is there a way to know how much memory is required for a given size of
> input? If one doesn't know up front necessarily how big the inputs
> are, how does one make sure the engine can handle what I throw at it?
It's only a cache, so it can handle it as long as it can operate
within the cache. The print happens when the cache fills twice
during the same input. It should really also check that the
distance between the two cache fills is approximately equal
to the number of states, meaning the cache hit ratio is very low.
There is a separate if statement elsewhere in dfa.cc that says
// After we reset the cache, we hold cache_mutex exclusively,
// so if resetp != NULL, it means we filled the DFA state
// cache with this search alone (without any other threads).
// Benchmarks show that doing a state computation on every
// byte runs at about 0.2 MB/s, while the NFA (nfa.cc) can do the
// same at about 2 MB/s. Unless we're processing an average
// of 10 bytes per state computation, fail so that RE2 can
// fall back to the NFA.
if (FLAGS_re2_dfa_bail_when_slow && resetp != NULL &&
(p - resetp) < 10*state_cache_.size()) {
params->failed = true;
return false;
}
Something more like that logic should probably control the
cache printing message too. If you add a print after that
if statement, like
LOG(ERROR) << "Reset " << (p-resetp) << " " << state_cache_.size();
then you can compare the two numbers and see for yourself
whether the cache is big enough to be effective.
Russ