http://www.pearsonhighered.com/educator/academic/product/0,1144,0321462254,00.html
I looked through Section 3.4 Algebraic laws for regular expressions of
Introduction to Automata Theory, Languages, and Computation, by
Hopcroft, Motwani and Ullman (2nd edition). But I don't see the set
difference between two regular sets. Could somebody show me how to
compute the regular expression for the difference between two regular
sets, given the two original regular expressions?
[It is my strong recollection that the difference of two REs is not
necessarily a RE, but I can't find the reference for it,
either. -John]
> Could somebody show me how to
> compute the regular expression for the difference between two regular
> sets, given the two original regular expressions?
Yes, regular languages are closed under the difference operation.
I haven't worked with them a lot lately, but I learned the following
way to derive the difference (intersection, union, ...) of two regexes
in a mechanic (if complex) way:
* Translate the regular expressions A, B into deterministic finite
automata
fA, fB. There is an algorithm for this.
* The finite automaton fC contains the Cartesian product of the states
in fA
and fB and contains the transition x:(a*b, c*d) if fA contains x:(a,
c) and
fB contains x:(b, d).
* In fC, a*b is the initial state iff a is initial in fA and b is
initial in
fB.
* In fC, state a*b is a final state iff:
- a is final in fA and b is not final in fB: difference (!)
- a is final in fA and b is final in fB: intersection
- a is final in fA or b is final in fB: union
- ...
* Remove unreachable states from fC.
* Translate deterministic finite automaton fC into a regular
expression C.
There is an algorithm for this.
There may be a faster way.
Good luck,
--
Michiel Helvensteijn
See my answer below.
> Suppose that I have two regular sets A and B that are generated by two
> regular expressions. I'm wondering if there is always a regular
> expression that can generated the difference between the two sets,
> i.e., A-B. If there is such regular expression, is there a mechanical
> way to generated such regular expression.
Yes, the difference between two regular sets is a regular set. Here is a
mechanical way to produce it:
1- Transform each of regular expressions A and B into automatons AUTA
and AUTB.
2- Create the following NFA:
StartState
|
+--(epsilon)--AUTA
|
\--(epsilon)--AUTB
3- Transform the NFA into a DFA, but determine its accepting state as
follows: A state is an accepting state if and only if it corresponds to
accepting state of AUTA and not to an accepting state of AUTB.
4- You can then use well-known DFA-to-regular-expression transformations
to get a (possibly ugly) equivalent regular expression.
It should be fairly obvious that this is theoretically sound. We are
simply computing a DFA that walks A and B in parallel, and that accepts
all elements of A that are not elements of B.
The beta version of SableCC 4.0 ( http://sablecc.org/ ) implements this
difference (called Except) operator for lexer specifications. You would
express it as follows:
Language my_test_language;
Lexer
// regular definitions
a = Any+;
b = 'Etienne';
a_difference_b = a Except b;
Token
a_difference_b; // list of tokens to be recognized by lexer
Etienne
--
Etienne M. Gagnon, Ph.D.
SableCC: http://sablecc.org
Sure. RL's are closed under complement (flip the accepting and non-
accepting states of the corresponding DFA) and intersection (build the
cross product DFA and make a state a x b accepting iff both a and b
are accepting in the original machines). Now you know how to build
the DFA for intersection(A, not(B)), which is the same as A-B. To go
back and forth from DFA to regex, use the constructions in any proof
of Kleene's theorem. John may be remembering that the difference of
CFLs is not necessarily context free.
[You're right. "Never mind." -John]
You can construct a set difference from intersection and
complementation: R \ S = R & ~S.
http://wiki.tcl.tk/461 explains how to build DFAs for the intersection/
complement of regular languages.
Then you can reverse engineer an RE from the DFA if need be.
--
Marcin
> Suppose that I have two regular sets A and B that are generated by two
> regular expressions. I'm wondering if there is always a regular
> expression that can generated the difference between the two sets,
> i.e., A-B. If there is such regular expression, is there a mechanical
> way to generated such regular expression.
>
> http://www.pearsonhighered.com/educator/academic/product/0,1144,0321462254,00.html
>
> I looked through Section 3.4 Algebraic laws for regular expressions of
> Introduction to Automata Theory, Languages, and Computation, by
> Hopcroft, Motwani and Ullman (2nd edition). But I don't see the set
> difference between two regular sets. Could somebody show me how to
> compute the regular expression for the difference between two regular
> sets, given the two original regular expressions?
I don't have the Hopcroft, Motwani, and Ullman book in front of me,
but from the table of contents, it looks like you are in the wrong
section. You want section 4.2.1 (Closure of Regular Languages uUnder
Boolean Operations) together with section 3.2 (Finite Automata and
Regular Expressions) and section 2.3.5 (Equivalence of Deterministic
and Nondeterministic Finite Automata).
Section 4.2.1 would be enough to answer in the affirmative your
question of existence; the difference of two regular sets is itself
regular. However, I would bet it shows this using a construction on
automata rather than regular expressions. Thus, to answer your
further question ("a mechanical way to generate such regular
expression"), you would need the techniques in section 3.2 for
converting back and forth between regular expressions and automata, as
well (probably) as the technique in section 2.3.5 for converting from
a nondeterministic automaton to a deterministic one. If I am guessing
correctly the contents of this book, a sketch of the overall algorithm
would look like this:
Convert the input languages A and B from REs to NFAs (section 3.2.3)
Construct the NFA for the set difference (section 4.2.1)
Convert that NFA into a DFA (section 2.3.5)
Convert the DFA into an RE (sections 3.2.1-3.2.2)
Regarding the core of the matter (constructing the set difference)
this may well not be explicit in section 4.2.1. Typical textbooks
instead show the constructions for intersection and complement. But
A-B is the same as A intersected with the complement of B.
The following paper, which is actually not a paper but a set of solutions
to a midterm examination, has some important clues.
It gives two algoirthms which are defined over NFA's: one for
computing the complement of an NFA (~A), and one for computing the
intersection of two NFA's, A&B.
The difference A-B can be given by this equivalence:
A-B <=> A&(~B)
I.e. applying both algorithms. In fact this is one of the questions
in this midterm:
http://cs164fa09.pbworks.com/f/midterm1_solutions.pdf
Actually, by de Morgan's we should be be able to eliminate &, right?
A&(~B) <=> ~(~A|~(~B)) <=> ~(~A|B)
So it looks like all we need is the complement to do difference.
Constructing the complement over NFA's is far simpler
than going from NFA -> DFA and back.
Complement is simple, in particular: the complemented NFA inherits all
of states and transitions of the original NFA. The meaning of the
states is inverted: non-accepting states in the original become
acceptance states, and vice versa. A new failure state is added which
is an acceptance state (due to the inverted sense of the
complement). Extra transitions are added such that for any state in
the original NFA which has no transition on some input character C, a
transition is added for that state and character which takes it to
this failure state. This rule is applied to the failure state itself,
which self-transitions on all inputs, serving as a ``honeypot'' for
arbitrarily long suffixes. I.e. every explicit failure in the
original NFA (no transition on some character) is routed to this
failure state which has accept semantics under the complement, and
whenever the original NFA reaches an accepting state, in the
complement NFA, this event corresponds to reaching a non-accepting
state. Thus success becomes failure and vice versa.
This still leaves you with the problem of reconstructing a regular
expression from an NFA.
A workable approach for doing the complement, difference, etc,
directly over the regular expression may be to exploit derivatives
somehow.
The derivative of a language L with respect so some string u, is the
set of suffixes of all strings in L which have u as a prefix, with the
u prefix removed. E.g if the languages is { abc, def, deg }, its
derivative with respect to de is { f, g }.
In consideration of a language being generated by a regex, the derivative
concept can be applied to the regex. For instance the derivative of
the regex abb* with respect to ab is the regex b* (which can derive the
empty string). The derivative can sometimes be the empty regex e,
or a special null value which means ``empty set'' or ``matches nothing''.
E.g. the derivative of the regex a* with respect to the string b is
this null value, which may be notated 0. We can write this as
d(b,a*) = 0.
Derivatives can be used to match regular expressions, as an alternative to NFA
construction, and quite readily support operations like complement and
intersection. They can be translated directly to DFA.
It looks like there should be a way to construct a complement, difference or
intersection using derivatives somehow.
Useful paper on derivatives: _Regular Expression Derivatives
Re-Examined_, Owens, Reppy, Turon:
http://www.ccs.neu.edu/home/turon/re-deriv.pdf (and its references).
Using pencil and paper, I think I found the start of a workable
algorithm for computing the compelment of a regex using derivatives,
directly. A similar approach should work for intersections.
Using a four letter {a, b, c, d} alphabet, I computed the following
example complement:
complement(a(abc)*d) -> (|[bcd]|a(abc)*[bc]|a(abc)*d[abcd]+|
aab[abd][abcd]*|aa[acd][abcd]*)
On informal inspection this seems right (do you see any mistakes).
I.e. an empty string mismatches, since the regex derives only nonempty
strings, so we include it. The the one-letter strings b, c and d do
not match, because the regex has a null derivative with respect to
these. An a followed by by zero or more repetitions of "abc",
followed by something other than an a or d (i.e. [bc]) does not match
and so is a branch of of the complement. And so forth.
The ``pre-algorithm'', loosely speaking, is to consider the space of
possible inputs, letter by letter. For each letter, compute the
derivative of the regex with respect to that letter. For all letters
deriving a 0 (null), you can construct a set; this is the set matched
by the complement regex at that character position: add this as a
disjunction to the complement regex, remembering that in the branch
you must match the prior characters leading up to that position). The
* operator is handled specially, along these lines. When the regex is
of the form (r1)*.r2 (I.e. a kleene followed by more regex material)
the complement should match zero or more r1, and then a character
which may not begin r1 or r2. Another branch matches (r1*), and then
continues the process. Of course, cases also have to be included for
mismatching /into/ r1, character by character. I don't have all this
worked out in detail, but just playing with pencil and paper
derivations, the concept seems promising.
In the above example, the rule gives us the two branches a(abc)*[bc],
and a(abc)*d[abcd]. In the first one, the complement regex matches
leading a, the string abc zero or more times, but then a "bc" (letters
which do not begin another "abc" repetition, nor match the "d" suffix
which follows). In the second one, the complement matches a leading a
with zero or more of the "abc" repeat, then a d which branches to the
next part of the pattern, but then an extra letter (wildcard of the
whole alphabet) which spoils the match. How that last one works is
that the derivative of d with respect to a, b, and c derives 0, so we
match a d (the complement of those three). Then we must match the
complement of the epsilon regex which is any letter, one or more
times. I.e. the complement regex must match all strings which are
prefixes of the regular regex, but have incompatible suffixes: d
followed by junk.
The mismatches through the (abc) part of *(abc) are reflected in
the branches aa[acd][abcd]* and aab[abd][abcd]*. I.e. the complement matches
the leading a, and then "a" (possible start of "abc") followed by something
other than b, and also the leading a, then "ab", followed by something other
than c, failing to complete the "abc". Both branches have to allow arbitrary
amount of junk, all of which mismatches the original regex and so matches
the complement.
Without a question, character class syntax with copmlements is essential, since
for an actual character set, the complement spaces are huge; e.g. we want to
encode [abcd] simply as . ; the equivalent of literally writing [abcd] even in
ASCII will kill us, not to mention Unicode.
In our above example, we really want to write it like this:
complement(a(abc)*d) -> (|[^a]|a(abc)*[^ad]|a(abc)*d.+|aab[^c].*|aa[^b].*)
Now it works for an alphabet which includes symbols other than just a, b, c
and d.
Oops; on second look, that gives an algorithm for complementing
a deterministic automaton!
> Complement is simple ...
*wipe egg from face*. Nope. Sorry about that mixup. :)
....
> Using a four letter {a, b, c, d} alphabet, I computed the following
> example complement:
>
> complement(a(abc)*d) -> (|[bcd]|a(abc)*[bc]|a(abc)*d[abcd]+|
> aab[abd][abcd]*|aa[acd][abcd]*)
>
> On informal inspection this seems right (do you see any mistakes).
> I.e. an empty string mismatches, since the regex derives only nonempty
> strings, so we include it. The the one-letter strings b, c and d do
> not match, because the regex has a null derivative with respect to
> these. An a followed by by zero or more repetitions of "abc",
> followed by something other than an a or d (i.e. [bc]) does not match
> and so is a branch of of the complement. And so forth.
Your algorithm which works directly on "regular expressions" is
actually minimicing the state traversal that one does in the finite
automaton versions of the algorithm. Not, that that is a bad thing.
In fact, it is one way of establishing the correctness of the
algorithm, by showing how it does exactly the same thing just using a
different notation.
In fact, it is possible to represent the states of a finite automaton,
as a series of sets of "dotted items" (each state being one set of
dotted items). Each dotted item, having a dot (cursor) in the regular
expression showing where in the regular expression that state
represents.
I believe you will find a correspondence between these sets of dotted
items and derivatives.
So, in that sense, there is an affirmative answer to the original
question. There is an algorithm that works directly with regular
expressions that generates the difference between two regular
expressions. However, the mechanics of the algorithm recapitulate
transforming the regular expressions into finite automatons and
working with the states to find taversals to all the states that are
final in the first automaton (subtrahend?) but not final in the second
(minuend?).
It is also possible to do the same thing with algebraic manipulations,
if you allow yourself to write an expression the with regular
expressions augmented by the set theory operators (union,
intersection, complement) and then algebraicly transform it into an
expression without those operators. My intuition tells me that the
algebra will effectively perform the same operations as the traversal
algorithm. However, I leave such as an exercise to the motivated
reader.
Hope this helps,
-Chris
******************************************************************************
Chris Clark email: christoph...@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------