On Mon, Nov 16, 2009 at 1:47 PM, Russ Cox <
r...@golang.org> wrote:
[...]
>> Thinking about it, one could combine strategies by expanding out the
>> DFA lazily at run-time with a memoization step. This should be O(m)
>> compilation and O(n) run-time but memory usage would go up to
>> O(min(m!, n)).
>
> You still have O(m*n) (but more commonly only O(n)) runtime.
Oops, you're obviously right. For a fixed regular expression and
arbitrarily long strings, runtime performance is O(n). But when your
regular expression and strings are both long, runtime performance is
O(n*m) because you don't necessarily benefit from on the fly
memoization of ordered sets of possible states you could be into into
single DFA states.
> The space is at least O(m) to hold the regexp but then only
> whatever you choose as your cache size. The search only
> needs one DFA state at a time (two during a transition) so the
> lower bound on the cache size is very small. There's certainly
> no need to remember every state during the run, which is
> what the n in O(min(m!, n)) seems to imply.
The idea of having a fixed cache size is a twist I hadn't considered.
Here is how I got O(min(m!, n)). I was thinking in terms of patching
in new DFA states that map back to an ordered set of NFA states you
could be matching at this point in this string. In the long run you
could theoretically wind up expanding the whole DFA out incrementally,
which means you're bounded above by a set of O(m!) states. But you
will only discover the DFA states you actually encountered while
trying to match the string, which gives you an upper limit of O(n)
states that you expand out. Satisfying both limits gives O(min(m!,
n)).
Limiting the cache size seems like it would lead to different
behaviors depending on how it is done. For instance one approach is
to only expand out some DFA states, then work out the DFA at run time
using states that are a combination of a common DFA combination and
several NFA states. This would lessen how often you head off towards
the worst case O(n*m) performance issues. An alternate approach would
be to expand out some DFA states, then when you got to big to switch
to backtracking, on the theory that you'll usually match one of the
top 5 states. This gives you exponential case worst performance, but
removes many common cases where it could happen.
The latter approach may seem to defeat the whole goal of a good
algorithm. But that kind of "fall back on backtracking" approach can
be used to match most regular expressions with guaranteed performance,
while still letting you match backreferences.
Cheers,
Ben