DFA memory cache errors and out of memory

701 views
Skip to first unread message

Ephraim Dan

unread,
Oct 11, 2010, 9:01:05 AM10/11/10
to re2-dev
What are the source of these errors?

re2/dfa.cc:1122: DFA memory cache could be too small: only room for
931 states.
re2/set.cc:104: RE2::Set::Match: DFA ran out of cache space

I tried increasing the memory with Options::set_max_mem but I get the
same thing, just with a higher number of states on the first error
message. This happens when I am running on a fairly large, but not
huge, input buffer that is 300K. I don't have this on smaller inputs.

Is there a way to know how much memory is required for a given size of
input? If one doesn't know up front necessarily how big the inputs
are, how does one make sure the engine can handle what I throw at it?
What is the end result of the message - did the match fail, or was it
handled gracefully in some way? Are there any cases at match time
that a match can fail, or are we guaranteed "correct" successful
matching despite these errors?

I also get these sometimes:

re2/dfa.cc:447: DFA out of memory: prog size 25089 mem 2695652

I also assume I can up the max_mem value to get rid of them, but I
guess I have the same questions as above - how can I know ahead of
time how much memory I need and what is the end result at match time?

Thanks

Elazar Leibovich

unread,
Oct 11, 2010, 9:54:41 AM10/11/10
to Ephraim Dan, re2-dev
I'm not a developer, so please take my word with caution, I might be wrong.

1) The RE2 engine should fall back to NFA if the DFA has failed. So it should be able to handle whatever you throw at him (with limitation of space and time - certain regexps have exponential running time).

2) I'm not sure the cost of computing the size of the DFA of a certain regular expression, is cheaper than the cost of computing the actual DFA.

3) What was the regular expression you were trying to use?

Ephraim Dan

unread,
Oct 11, 2010, 10:22:13 AM10/11/10
to re2-dev

> 1) The RE2 engine should fall back to NFA if the DFA has failed. So it
> should be able to handle whatever you throw at him (with limitation of space
> and time - certain regexps have exponential running time).

One of the reasons I want to use RE2 is to solve the potential
exponential running time that we get with pcre. Is there no way to
guarantee avoiding this with RE2 as well? Are there other ways to
mitigate?

> 2) I'm not sure the cost of computing the size of the DFA of a certain
> regular expression, is cheaper than the cost of computing the actual DFA.

I don't quite follow. But I am getting these errors at match time,
not compile time. So I want to know if there is any way to know
without actually performing the match, how much memory the DFA might
take?

> 3) What was the regular expression you were trying to use?

It is a set of about 1000 regular expressions.

Elazar Leibovich

unread,
Oct 11, 2010, 11:17:43 AM10/11/10
to Ephraim Dan, re2-dev
On Mon, Oct 11, 2010 at 4:22 PM, Ephraim Dan <eda...@gmail.com> wrote:

> 1) The RE2 engine should fall back to NFA if the DFA has failed. So it
> should be able to handle whatever you throw at him (with limitation of space
> and time - certain regexps have exponential running time).

One of the reasons I want to use RE2 is to solve the potential
exponential running time that we get with pcre.  Is there no way to
guarantee avoiding this with RE2 as well?  Are there other ways to
mitigate?

Recognizing whether or not a string matches an RE can be an hard problem. For instance, one can use it to find out whether a number is a prime [1]. It's not always an easy thing to do. Converting the RE to a DFA is just a trade off between execution time and space, instead of (potentially) using exponential amount of computing time, you'll (potentially) use exponential amount of memory.

Moreover, I think that finding out what's the exact size of the DFA that matches a certain RE, is not an easy problem too! I'm not sure you can find out what's the expected size of the DFA without actually calculating it.

What you can do, if you want to be absolutely certain you can match this string with DFA, is to build the whole DFA, and if it fits your memory budget - you can be sure that you'll be able to match it in linear time. The function Prog::BuildEntireDFA does just that.



> 2) I'm not sure the cost of computing the size of the DFA of a certain
> regular expression, is cheaper than the cost of computing the actual DFA.

I don't quite follow.  But I am getting these errors at match time,
not compile time.  So I want to know if there is any way to know
without actually performing the match, how much memory the DFA might
take?

Compiling the RE does not build the DFA. The DFA is built in run time. This is because you don't always have to build the entire DFA in order to match a string. Compiling the DFA merely transform the RE into an NFA represented by set of assembly commands, and runs the OnePass optimization (if applicable) on it.
 

> 3) What was the regular expression you were trying to use?

It is a set of about 1000 regular expressions.

If you can find out the RE which takes a really big amount of memory, I'll be glad to see it. Is it something along the line of (x+x+)y+? or (aaa)*|(aa)*|(aaaaaaaaaaa)*|(aaaaaaaaaaaaaaaaaaa)*|(aaaaaaa)*|(aaaaaaaaaaaaaaaaa)* (see explanation in benchmarking section of [2]) or (a|b)*a(a|b){m} (see [3])
I love those little examples.

 

Russ Cox

unread,
Oct 11, 2010, 5:06:24 PM10/11/10
to Ephraim Dan, re2-dev
> I tried increasing the memory with Options::set_max_mem but I get the
> same thing, just with a higher number of states on the first error
> message.  This happens when I am running on a fairly large, but not
> huge, input buffer that is 300K.  I don't have this on smaller inputs.

That's probably fine. I should make the warning less aggressive.

> Is there a way to know how much memory is required for a given size of
> input?  If one doesn't know up front necessarily how big the inputs
> are, how does one make sure the engine can handle what I throw at it?

It's only a cache, so it can handle it as long as it can operate
within the cache. The print happens when the cache fills twice
during the same input. It should really also check that the
distance between the two cache fills is approximately equal
to the number of states, meaning the cache hit ratio is very low.

There is a separate if statement elsewhere in dfa.cc that says

// After we reset the cache, we hold cache_mutex exclusively,
// so if resetp != NULL, it means we filled the DFA state
// cache with this search alone (without any other threads).
// Benchmarks show that doing a state computation on every
// byte runs at about 0.2 MB/s, while the NFA (nfa.cc) can do the
// same at about 2 MB/s. Unless we're processing an average
// of 10 bytes per state computation, fail so that RE2 can
// fall back to the NFA.
if (FLAGS_re2_dfa_bail_when_slow && resetp != NULL &&
(p - resetp) < 10*state_cache_.size()) {
params->failed = true;
return false;
}

Something more like that logic should probably control the
cache printing message too. If you add a print after that
if statement, like

LOG(ERROR) << "Reset " << (p-resetp) << " " << state_cache_.size();

then you can compare the two numbers and see for yourself
whether the cache is big enough to be effective.

Russ

Reply all
Reply to author
Forward
0 new messages