Possible to determine if one regex is a "subset" of another?

1,227 views
Skip to first unread message

Andy

unread,
Mar 4, 2009, 10:42:38 AM3/4/09
to Regex
Hello,
I am trying to come up with a way, given two regular expressions A and
B as input, to answer the following question: Does B match every
string that A matches?
Put another way, suppose L(A) is the language of expression A (ie the
set of all strings that A matches, which may be infinitely many
strings). Then I want to know whether L(A) is a subset of L(B). Is
there an algorithm out there to determine this?
Thanks,
Andy

Ralph Boland

unread,
Mar 11, 2009, 12:54:52 AM3/11/09
to Regex
The standard way is to construct minimum state DFAs from the regular
expressions
You can then test if one DFA is a labelled subgraph of the other.
There are easier ways though and I use one in a tool I am constructing
for constructing
DFAs from regular expressions.

Ralph Boland

Andy

unread,
Mar 11, 2009, 3:39:13 PM3/11/09
to Regex
Ralph,
Thanks for the response. I'd been thinking that working with minimal
DFAs was the way to go, but hadn't gotten any farther than that.
But, if L(A) is a subset of L(B), does that necessarily mean DFA(A) a
subgraph of DFA(B)? For example, suppose A=/bar/ and B=/.*/. Clearly
L(A) = {bar} is a subset of L(B) = all strings. If I'm doing this
right, DFA(A) has 4 vertices (or states) and DFA(B) only needs one, so
DFA(A) can't be a subgraph of DFA(B). Or am I missing something about
the definition of "labeled subgraph" that allows for this?
Andy

Ralph Boland

unread,
Mar 13, 2009, 11:32:46 PM3/13/09
to Regex

Yes, you're right as far as I can see.
The only way that I can see now to test for subset is the method I use
for testing for equivalence.
I will describe the algorithm starting from two NFAs.

Let B and C be two NFAs (possibly DFAs).
We want to test if L(B) is a subset of L(C) and conversely.
Let's construct the DFA of B | C.
By using the or operator '|' between NFAs I mean to use the method
used for the '|' operator
in Thomson's construction. That is: add a new start state and a new
end state and join
the new start state to each of the start states of B and C by
epsilon transitions;
then join the end states of B and C to the new end state by
epsilon transitions.

Let D be the DFA constructed from B | C using the subset
construction algorithm.
Let d be any state of D. As a result of the subset construction
algorithm,
d has associated with it a set of NFA states s(d) that distinguish
d from any other state of D.
s(d) is constructed from the states of B and C, such that
s(d) = s(d,B) disjointUnion s(d,C)
where s(d,B) is the subset of the states of B that are states of s
(d)
and s(d,C) similarly is the subset of states of C that are states
of s(d).

Let t be a string such that, after reading t, D is in state d.
Assume s(d,B) is empty. Then B will reach an error on reading t
before reaching the end of t.
But then s(d,C) is not empty since s(d) is not empty.
Thus, C will NOT reach an error on reading t before reaching the
end of t.
As a result, L(C) is not a subset of L(B) because there are strings
accepted by C that are not
accepted by B; some (possibly all) of those strings begin with t.

More generally, L(B) is a superset of L(C) if for every state e
of D, s(e,B) is not empty.
Similarly, L(C) is a superset of L(B) if for every state e
of D, s(e,C) is not empty.
If L(B) is a superset of L(C) and conversely, then B and C are
equivalent.

Thus the algorithm is:
1) Construct D from B | C.
2) Return "B and C equivalent" if for every state e of D, s
(e,B) and s(e,C) are not empty.
3) Return "B is a subset of C" if for every state e of D, s
(e,C) is not empty.
4) Return "C is a subset of B" if for every state e of D, s
(e,B) is not emtpy.
5) Return "B is not a subset of C and conversely".
6) Send Ralph Boland $1000.

OK; the last step is optional.

Notes:

1) It is not always necessary to construct all of D. As each state
of D is constructed
by subset construction the subset or equivalence test can be applied.
The algorithm can
be terminated immediately if a test failure occurs. In my case the
finite automata are
almost always equivalent so usually all states get tested.

2) In principle, this algorithm could be more expensive for testing
equivalence than constructing
miniminal state DFAs and testing for isomorphism, especially if B
and C are DFAs.
But in practice, I find the algorithm works quite well. I have never
implemented the isomorphism
approach so I can't provide comparisons. My algorithm is certainly
simpler though; especially if
you have all the tools in place to construct DFAs from regular
expressions but not the DFA
minimization algorithm as was the situation I was in at the time.

3) I am sure that I am not the first to come up with this algorithm
but I have no references.

4) If there is a better way, I'd love to hear it.

5) I never meant the above to be a formal proof but a brief
explanation of the algorithm so
if you want more rigor you're on your own though you could post
your result here.

6) The code I have referred to is Smalltalk code which I will
eventually release as Open source.

Ralph Boland

Andy

unread,
Mar 16, 2009, 6:11:53 PM3/16/09
to Regex
On Mar 13, 11:32 pm, Ralph Boland <rpbol...@gmail.com> wrote:
> Let t be a string such that, after reading t, D is in state d.
> Assume s(d,B) is empty. Then B will reach an error on reading t
> before reaching the end of t.
> But then s(d,C) is not empty since s(d) is not empty.
> Thus, C will NOT reach an error on reading t before reaching the
> end of t.

Not sure I agree with this. Suppose s(d,C) is non-empty, and let w be
a prefix of t such that after reading w, D is in state e. It is
possible that S(e,C) is empty, and C would reach an error while
reading w. But, D will not, because s(e,B) is non-empty, so D can
reach state d as you describe above.

It seems that in order to know that C will read t without error, you
need a stronger condition: for every state x of D that is visited
while D reads t, s(x,C) must be non-empty. But I think you can get
that result by induction on the length of t without too much trouble,
and of course the actual algorithm requires every state in D to have s
(x,C) non-empty, so we're covered there.

Now you've got me interested enough to try and sketch out a
correctness proof for this, I'll post again if I come up with one :)

Andy Blank

Ralph Boland

unread,
Mar 18, 2009, 1:27:05 AM3/18/09
to Regex


On Mar 16, 4:11 pm, Andy <blankma...@gmail.com> wrote:
> On Mar 13, 11:32 pm, Ralph Boland <rpbol...@gmail.com> wrote:
>
> > Let  t  be a string such that, after reading  t,  D  is in state  d.
> > Assume  s(d,B)  is empty. Then  B  will reach an error on reading  t
> > before reaching the end of  t.
> > But then  s(d,C) is not empty since s(d) is not empty.
> > Thus, C  will NOT reach an error on reading  t  before reaching the
> > end of  t.
>
> Not sure I agree with this.  Suppose s(d,C) is non-empty, and let w be
> a prefix of t such that after reading w, D is in state e.  It is
> possible that S(e,C) is empty, and C would reach an error while
> reading w.  But, D will not, because s(e,B) is non-empty, so D can
> reach state d as you describe above.

Can you give an example of this.
I think you will find that you cannot construct one.
Admittedly, the onus should be on me to prove my algorithm correct,
not claim it is correct because no one can find a counter-example.
But I wasn't trying to write a proof, just a programmer acceptable
argument.
(I know, this statement is probably getting me into a lot of trouble).

Let me provide this argument:
If s(e,C) is empty then s(e,B) is not empty and so B can read up
to string w
without error but C cannot. So now, in my argument, swap B and
C and replace
t by w. I am now defending my argument by effectively shortening
the string t.
You can of course repeat your argument with an even shorter string,
say v.
We can repeat this process again and again until, after at most |t|
iterations,
you run out of string. Dare I say, Q.E.D.
Another way of putting it is that I should have added to my
argument the restriction that t be the shortest possible example.
Since your w would be an even shorter example it cannot be
considered.

>
> It seems that in order to know that C will read t without error, you
> need a stronger condition: for every state x of D that is visited
> while D reads t, s(x,C) must be non-empty.  But I think you can get
> that result by induction on the length of t without too much trouble,

I think your argument is pretty much the same as mine as to why your
proposed counter-example cannot really happen.


> and of course the actual algorithm requires every state in D to have s
> (x,C) non-empty, so we're covered there.
>
> Now you've got me interested enough to try and sketch out a
> correctness proof for this, I'll post again if I come up with one :)
>

Keep in mind that my algorithm surely exists in the literature
somewhere,
likely with a proper correctness proof and complexity analysis to
boot.

> Andy Blank

Ralph Boland

unread,
Mar 21, 2009, 2:45:27 AM3/21/09
to Regex

> ...
> More generally,  L(B)  is a superset of  L(C)  if for every state  e
> of  D,  s(e,B)  is not empty.
> Similarly,       L(C)  is a superset of  L(B)  if for every state  e
> of  D,  s(e,C)  is not empty.
> If  L(B)  is a superset of  L(C)  and conversely, then  B  and  C  are
> equivalent.
>
> Thus the algorithm is:
> 1) Construct  D  from  B | C.
> 2) Return "B  and  C  equivalent" if for every state  e of  D,  s
> (e,B)  and  s(e,C)  are not empty.
> 3) Return "L(B)  is a subset of  L(C)" if for every state  e  of  D,  s
> (e,C)  is not empty.
> 4) Return "L(C)  is a subset of  L(B)" if for every state  e  of  D,  s
> (e,B)  is not emtpy.
> 5) Return "L(B)  is not a subset of  LO(C)  and conversely".
> 6) Send Ralph Boland $1000.
>
> OK; the last step is optional.
>

I missed an important point here: final states.
For L(B) to be a subset of L(C) it must be that for every state e
of D,
if s(e,B) contains a final state of B then s(e,C) must contain a
final state of C.

If this additional condition does not hold then
L(B) is not a subset of L(C) but instead, for any string t
accepted by D, t is of the form t = uv where ur is accepted by
B
for some string r.
This weaker condition is not useful because in general u can be the
empty string. However, additional tests can be added so that the
condition becomes t is of the from t = uv where u is accepted by
B.
This I believe is useful. I have not tested this in my code though as
I don't need it.

Ralph Boland


P.S. Boy, the money is really piling in. ;-)
Reply all
Reply to author
Forward
0 new messages