one that can converted into a finite state machine
Another way of saying is that a 'regular expression' is a pattern that
can be used for matching with no backtracking needed. Such pattern
matches are fast.
My guess is that early implementations of regular expressions were in
fact implemented as finite state machines, but people decided they were
not powerful enough. So they kept on adding extensions until the
original goal of having regular expressions, i.e., to match with no
backtracking, was lost.
The ? suffix in perl5, which causes quantifiers to become nongreedy,
help bring perl paterns close to real regular expressions.
/a.*b.*c/ # greedy, needs backtracking, slow, not regular expression
/a.*?b.*?c/ # nongreedy, no backtracking, fast, regular expression
Perhaps it's now too late, and we need to find a new term for what
we used to call regular expressions.

Rahul Dhesi <dh...@spams.r.us.com>
a2i communications, a quality ISP with sophisticated antijunkmail features
** message body scan immune to fake headers *** see http://www.rahul.net/
>>> "please ignore Dhesi"  Mark Crispin <m...@CAC.Washington.EDU> <<<
>The ? suffix in perl5, which causes quantifiers to become nongreedy,
>help bring perl paterns close to real regular expressions.
>
> /a.*b.*c/ # greedy, needs backtracking, slow, not regular expression
> /a.*?b.*?c/ # nongreedy, no backtracking, fast, regular expression
>
>Perhaps it's now too late, and we need to find a new term for what
>we used to call regular expressions.
My understanding is that both of the examples above are regular
expressions. Regular expressions are capable of supporting r* (Kleen
closures), r+ (positive closures), and r? (whatever this is called).
As far as I know, the only nonregular features of Perl regular
expressions is the ability to backreference a previous match using
the \1 and related patterns. This backreferencing ability allows
Perl regular expressions to recognise balanced and nested constructs,
two of the classic ``no can do'' types of patterns that can only be
expressed in a contextfree or more powerful grammer.
[ References available is sec 3.3 of the red dragon book, as well as
most every other compiler or formal language theory book ever
written. ]
So while Perl regular expressions aren't completely regular, they're
close enough that I'm not uncomfortable referring to them by that
term. I don't that a new term is called for (even though I think that
``irregular expressions'' has a nice ring to it).
 John Kodis.
I do not think the lookahead assertions are regular. I would be
surprised if they are. AFAIK the class of "all languages minus a
regular one" isn't regular, and you can describe that language with
/^(?!$regex$)/. (You can do intersection and subtraction of RE's
with lookahead too.)
Abigail
> AFAIK the class of "all languages minus a
> regular one" isn't regular, and you can describe that language with
> /^(?!$regex$)/. (You can do intersection and subtraction of RE's
> with lookahead too.)
Actually, it's more clearly described as "all strings not matched by a
given regex", which is regular, for a very simple reason that is very
nonsimple until someone tells you the trick. ;)
Simple explanation: there is a 11 correspondence between regexes and
NFAs. To negate an NFA, make every accepting state nonaccepting, and
vice versa.
It seems like intersection could be done by... er... taking the
cartesian product of the power sets of the states in the two NFAs, and
then making a state accepting iff it contains at least one accepting
state from the first NFA *and* from the second NFA. But I'm just doing
that off the top of my head and I suck at these sorts of things. For all
I know, regular languages may not be closed under intersection.
So now can anyone give me the simple explanation for why language
ambiguity is undecidable? I should know this, but I don't and I'm too
lazy to look it up. (And please don't send it to the perl newsgroup!)
> ++ Previously, Rahul Dhesi wrote:
> ++ > /a.*b.*c/ # greedy, needs backtracking, slow, not regular expression
My comments on the above comments:
 right
 depends on the implementation
 maybe
 wrong. This is as regular as it gets.
> ++ > /a.*?b.*?c/ # nongreedy, no backtracking, fast, regular expression
 right
 often would need backtracking, eg for the string axbbc the matcher will
first pick x for the first .*?, but then will have to reconsider and pick
xb.
 sometimes
 yes it is regular. The greedy/nongreedy distinction doesn't affect
regularity, but only what parts of the string are mathched to what part of
the regexp.
> ++ Regular expressions are capable of supporting r* (Kleen
> ++ closures), r+ (positive closures), and r? (whatever this is called).
A shortcut for rempty. The nongreedy modifier ? is not part of the
formallanguage definition of regexp.
> ++ the \1 and related patterns. This backreferencing ability allows
> ++ Perl regular expressions to recognise balanced and nested constructs,
No, it doesn't. But it's true that \1 is nonregular.
> AFAIK the class of "all languages minus a regular one" isn't regular,
What exactly do you mean by the above construction? Because the complement of
a regular language R (the set of all strings not in R) is certainly regular.
Just take the corresponding automaton and "invert" it: make all accepting
states nonaccepting and vice versa.
> (You can do intersection and subtraction of RE's with lookahead too.)
The intersection of two regular languages is regular. Take the two automata A1
and A2 and make their product A3. The states of A3 are pairs <q1,q2> of states
of A1 and A2 and the transitions of A3 are <q11,q21> a> <q12,q22> iff
q11 a> q12 in A1 and q21 a> q22 in A2. The accepting states of A3 are
<q1,q2> where q1 is accepting in A1 and q2 is accepting in A2.
By the intersection and complement results, it also follows that the
difference of two regular languages is also regular. The same holds for union,
though a more direct proof is available: take the disjoint union of A1 and A2
(except merge their initial states); this automaton is nondeterministic, but
it still recognizes a regular language.
But, this is not a theory group :)
> My apologies for putting this on a perl newsgroup, but it sorta kinda
> relates.
You shouldn't apologize for educating people :) Although I've read somewhere
that lecturers at 16th century universities usually started with "Esteemed
colleagues, please kindly allow me to bore you with the material I have so
carefully prepared".
> It seems like intersection could be done by... er... taking the
> cartesian product of the power sets of the states in the two NFAs
If you start with DFAs, you can simply take the cartesian product of the sets
of states, not the power sets.
> For all I know, regular languages may not be closed under intersection.
They are.
[mail and post]
Rahul Dhesi <c.c....@33.usenet.us.com> wrote:
> /a.*b.*c/ # greedy, needs backtracking, slow, not regular expression
> /a.*?b.*?c/ # nongreedy, no backtracking, fast, regular expression
There's no fundamental difference in the above two, other than in which of
the plausable matches will be selected. Both do backtrack, (and, for that
matter, will do so exactly an equal number of times for a nonmatch). There
are times when the first will backtrack much more than the second, and
times when the second will backtrack much more than the first.
All other things being equal, the first (the one you say is slow) will be
faster than the second, but this is due to some of Perl's internal
optimizations. In any case, knowledge of the likely data (and the
requirements of the task, of course) often make one better than the other.
> Perhaps it's now too late, and we need to find a new term for what
> we used to call regular expressions.
Perhaps it's best to differentiate between the calculus phrase ``regular
expression'', coined by Dr. Stephen Kleene in the 40s, and the use of the
same term in modern computer science. There are many situations in language
where the meanings of words have changed. Some examples:
When I was growing up, ``How are you doing?'' was not a statement.
A Japanese futon has no relationship to what is called a "futon" in
American English.
Tennis shoes are rarely used for tennis.
The little eraserhead pointing devices on laptops are called mice.
Ah, well, I could go on forever. But basically, the term ``regular
expression'' has no implied meaning to most people that first come across
it in Perl or whatnot, so little confusion arises. It would have been nice
had some other word been used from the start (perhaps one giving more of a
hint as to what the thing is all about), but for the most part we're stuck
with history. I suppose you could start a crusade to use a different
phrase. How about 'whachamacallit'? :)
Jeffrey

Jeffrey Friedl <jfr...@ora.com>
See my Jap<>Eng dictionary at http://www.wg.omron.co.jp/cgibin/je
O'Reilly's Regular Expression book: http://enterprise.ic.gc.ca/~jfriedl/regex/
In <5gcv7d$j...@samba.rahul.net> I wrote:
> /a.*b.*c/ # greedy, needs backtracking, slow, not regular expression
> /a.*?b.*?c/ # nongreedy, no backtracking, fast, regular expression
The above examples would have been better written as anchored:
/^a.*b.*c/ # greedy, needs backtracking, slow, not regular expression
/^a.*?b.*?c/ # nongreedy, no backtracking, fast, regular expression
Now the second one needs no backtracking:
find an 'a' or fail;
skip past all non'b' characters;
find the 'b' or fail;
skip past all non'c' characters;
find the 'c' or fail;
succeed
A nonanchored version simply repeats the search at every point in the
search string until it succeeds or, if it runs out of string, fails. So
the nonanchored nongreedy version will also need no backtracking.
I am assuming that onecharacter lookahead is not considered to be
backtracking. If onecharacter lookahead is backtracking, then here are
better examples:
/^a.*b.*c/ # greedy, needs backtracking, slow, not regular expression
/^a[^b]*b[^c]*c/ # nongreedy, no backtracking, fast, regular expression
The important point in all this is that some patterns are regular
expressions and others are not, and calling all of them regular
expressions is not a good idea. The term 'search pattern' or just
'pattern' is nicely generic and does not cause any confusion. Those
being introduced to search patterns for the first time will find no
useful meaning in the term 'regular expression' anyway, since it's not
an English term but a mathematical term. If we must use a mathematical
term, we should use it correctly.
> No, it doesn't. But it's true that \1 is nonregular.
 depends on the implementation
$balanced = study /(
[^\(\)]+

\(
(?>$balanced)
\)
)*/x ;
 maybe
Ilya
P.S. study and (?>...) are not implemented yet, but close. As is
an optimization of (a+b)* for most a and b.
[...]
>Ah, well, I could go on forever. But basically, the term ``regular
>expression'' has no implied meaning to most people that first come across
>it in Perl or whatnot, so little confusion arises. It would have been nice
>had some other word been used from the start (perhaps one giving more of a
>hint as to what the thing is all about), but for the most part we're stuck
>with history. I suppose you could start a crusade to use a different
>phrase. How about 'whachamacallit'? :)
And I highly recommend the second edition of Jeffrey's Book called,
"Mastering Whachamacallits"

Clay Irving See the happy moron,
cl...@panix.com He doesn't give a damn,
http://www.panix.com/~clay I wish I were a moron,
My God! Perhaps I am!
> The above examples would have been better written as anchored:
>
> /^a.*b.*c/ # greedy, needs backtracking, slow, not regular expression
> /^a.*?b.*?c/ # nongreedy, no backtracking, fast, regular expression
Still wrong. Both match exaclt the same strings. Greedy matching says "return
(one of) the way(s) to match which maximize the (leftmost) greedy *".
Nongreedy match says "return (one of) the way(s) to match which minimizes the
(leftmost) nongreedy match". Neither changes the set of matching strings, nor
the set of ways to match.
> Now the second one needs no backtracking:
> find an 'a' or fail;
> skip past all non'b' characters;
Wrong. Consider /^a.*?b[^b]*?c/ and the string 'abbc'. It will initially pick
the first b in the string as the b in the regexp, assuming that the .*? is
empty. Then it will have to reconsider and assign one b to the .*?
> A nonanchored version simply repeats the search at every point in the
> search string
This would be a very naive and slow way to implement it.
> /^a[^b]*?b[^c]*?c/ # nongreedy, no backtracking, fast, regular expression
This one won't need backtracking indeed.
> some patterns are regular expressions and others are not, and calling all of
> them regular expressions is not a good idea.
You have a very wrong idea of what is a regular expression, I'm afraid.
Regular languages are ones that can be produced using regular operations:
empty, letter, concatenation, alternative, and star. The greedy/nongreedy
distinction doesn't come into play here at all. Some of the perl extensions
still keep it regular (eg foo{m,n} is just foo{m}foo{m+1}..foo{n} where
foo{x} stands for 'foo foo ..<x times>.. foo') and some don't (eg (.*)\1 is
not even a contextfree language).
How a matcher is implemented (backtracing etc) has no bearing on the langauge
it recognizes (unless it's implemented incorrectly :). Consider this: DFA
(deterministic automata) and NFA (nondeterministic ones) have the same
expressive power, namely that of regular languages. The standard construction
that converts a NFA into an equivalent DFA takes as states of the DFA to be
the powerset of the states of the NFA. Thus if the NFA has n states, the DFA
has 2^n. Which doesn't mean that any selfrespecting programmer will implement
it that way.
> If we must use a mathematical term, we should use it correctly.
Indeed ;)
In comp.lang.perl.misc,
c.c....@33.usenet.us.com (Rahul Dhesi) writes:
:The important point in all this is that some patterns are regular
:expressions and others are not, and calling all of them regular
:expressions is not a good idea. The term 'search pattern' or just
:'pattern' is nicely generic and does not cause any confusion. Those
:being introduced to search patterns for the first time will find no
:useful meaning in the term 'regular expression' anyway, since it's not
:an English term but a mathematical term. If we must use a mathematical
:term, we should use it correctly.
I'll make a deal: you get everyone to first rename their grep programs to
gspp, and I'll change the perl docs. Remember that grep allows \1 stuff
in its regular expressions.
Don't we have anything more important to argue about?
tom

Tom Christiansen tch...@jhereg.perl.com
"Yes, you can think of almost everything as a function, but this may upset
your wife." Larry Wall
>Consider this: DFA
>(deterministic automata) and NFA (nondeterministic ones) have the same
>expressive power, namely that of regular languages.
Therefore to decide whether a perl pattern is a regular expression we
need only ask if we can create either a DFA or an NFA to implement the
search.
In general we cannot implement the search for most perl patterns
with DFAs or NFAs because most of them require backtracking.
Because to implement backtracking you must augment the DFA or NFA with
memory, and once you add memory, it's no longer a DFA or NFA. A DFA or
NFA has no memory, other than knowing what state it is in. It does not
remember how it got to that state, so it cannot backtrack.
> In general we cannot implement the search for most perl patterns
> with DFAs or NFAs because most of them require backtracking.
Backtracking is an implementation feature of the perl pattern matcher (ok, of
any reasonable pattern matcher) that has nothingnto do with regularity.
> Because to implement backtracking you must augment the DFA or NFA with
> memory, and once you add memory
I don't know the precise sense in which you use "backtracking" above, but in
eg Prolog it is used to implement nondeterminism (to return all solutions to a
problem). You could find all solutions to a problem with a device that's
capable of cloning itself at every choice point. Which a NFA precisely is.
> It does not remember how it got to that state, so it cannot backtrack.
Nor does it need to. Backtracking is a hack used to model nondeterminism (in a
sense, parallelism) on a sequential computer. You don't need backtracking if
you have NFA in your disposal, and I think you agreed that NFAs are still
regular. (BTW they don't have a higher expressive power than DFA for a "wrong
reason": the simulation of a NFA by a DFA explodes the state space
exponentially, but the exponentiation of a finite number is still finite. This
becomes a concern when you care about the "complexity" of an algorithm, eg
NPcomplete problems [the epitome of "exponentially hard" problems] can be
solved in polytime on a NTM; in fact that is the definition of NP.)
> >>> "please ignore Dhesi"  Mark Crispin <m...@CAC.Washington.EDU> <<<
In this case I fully agree with Mark.
>You don't need backtracking if
>you have NFA in your disposal, and I think you agreed that NFAs are still
>regular.
Are you claiming that every perl search can be implemented as an NFA?
If not, then I fail to see the relevance of the above argument.

Rahul Dhesi <dh...@spams.r.us.com>
a2i communications, a quality ISP with sophisticated antijunkmail features
** message body scan immune to fake headers *** see http://www.rahul.net/
Clay Irving <cl...@panix.com> wrote:
> >with history. I suppose you could start a crusade to use a different
> >phrase. How about 'whachamacallit'? :)
>
> And I highly recommend the second edition of Jeffrey's Book called,
> "Mastering Whachamacallits"
Mmmm, somehow this reminds me of the last line of the first paragraph
on page 71 (of the first edition :)
Jeffrey

Jeffrey Friedl <jfr...@ora.com>
O'Reilly & Associates' _Mastering Regular Expressions_
http://enterprise.ic.gc.ca/~jfriedl/regex/
[mail and post and cc]
Rahul Dhesi <c.c....@33.usenet.us.com> wrote:
> I got email from Jeffrey Friedl and I'm not sure if it was private email
> or a copy of a posting. In any case, based on his comment's here's a
> followup posting.
I wish you would have posted my original note, since your conclusions were
pretty unrelated to what I wrote. ["private email" refers to the method of
sending, not the contents  like with all other means of human
communication, I rely only on your sense of right and wrong to not divulge
that which I intend to be private between you and me  obviously, that
doesn't apply here, so you should have felt free to post it if you weren't
sure if it had been. This is a mail and post, though, so need here]
> /^a.*b.*c/ # greedy, needs backtracking, slow, not regular expression
> /^a.*?b.*?c/ # nongreedy, no backtracking, fast, regular expression
>
> Now the second one needs no backtracking:
>
> find an 'a' or fail;
The "or fail" is, by definition, backtracking. As I told you in my earlier
message, both will backtrack. There are situations where the 2nd will
backtrack more than the 1st. If there is no match, they'll both backtrack
exactly the same number of times (since they both match the same possible
sets of strings, but in different orders, they both must test all
permutations before a nonmatch is reported, and the order doesn't
influence the number).
> I am assuming that onecharacter lookahead is not considered to be
> backtracking.
That's a silly assumption, but also irrelevant to the problem at hand. The
``lookahead'' for the first .*?, for example, is /b.*?c/, not simply one
character.
> If onecharacter lookahead is backtracking, then here are better
> examples:
>
> /^a[^b]*b[^c]*c/ # nongreedy, no backtracking, fast, regular expression
This indeed will match faster, in general, but still requires backtracking.
Any match will, by definition, have entailed 2 backtracks.
> The important point in all this is that some patterns are regular
> expressions and others are not, and calling all of them regular
> expressions is not a good idea.
I don't know what this comment has to do with the rest of your post (is it
supposed to be related?), but to a preponderance of people that *use* these
things, ``regular expression'' has just as much implied meaning as ``bob'',
so it's not a hindrance. Sure, it's not what the algebraic term ``regular
expression'' means, but so what? I guess if it hits your hot button you'd
care. I don't happen to think it's a big deal, but don't deny you your right
to think it is :)
> The term 'search pattern' or just
> 'pattern' is nicely generic and does not cause any confusion.
Indeed.
> If we must use a mathematical term, we should use it correctly.
You mean like ``addition'' and the like? The computation meaning of
these terms is SUBSTANTIALLY different from the real meaning  in
real life, mathematical ops don't lose precision, overflow, etc.
Seems to be in the same league to me.
Vladimir Alexiev <vlad...@cs.ualberta.ca> followed up:
> > /^a[^b]*?b[^c]*?c/ # nongreedy, no backtracking, fast, regular expression
> This one won't need backtracking indeed.
Sure it does... when the [^b] fails, it can backtrack to the ``gee, because
of the * I didn't really need to attempt [^b], so let's skip it'' state.
> How a matcher is implemented (backtracing etc) has no bearing on the
> langauge it recognizes (unless it's implemented incorrectly :).
> Consider this: DFA (deterministic automata) and NFA (nondeterministic
> ones) have the same expressive power
In theory (and some practice), an NFA and DFA can be made to match the
exact same language, but in common practice, an ``NFA'' is usually endowed
with a much more expressive language.
To a DFA (awk, lex, etc.), /aand/ and /anda/ and /a(nd?)/ are *exactly*
the same. (Internally, they boil down to the same thing.) Yet they're
quite different with Perl and the majority of regex engines out there.
(Quiz: the 3rd will match the same as which of the first two?)
> Regular languages are ones that can be produced using regular operations:
> empty, letter, concatenation, alternative, and star. The greedy/nongreedy
> distinction doesn't come into play here at all.
Oh, yes it does! (unless you're talking only calculus theory and not
considering common practice). If they're the same, then so are /.*/ and
/.*?/, and if the target string has any [nonnewline] characters they're
guaranteed to be different.
Rahul Dhesi <c.c....@33.usenet.us.com> wrote:
> Because to implement backtracking you must augment the DFA or NFA with
> memory, and once you add memory, it's no longer a DFA or NFA.
``memory'' is simply a method of implementation. If you implemented a true
NFA and made the right decision at each step, you'd need no memory. Pretty
similar to a maze. A DFA is an NFA with the right decision guaranteed in
practice, much like a maze with all wrongpaths closed off.
> >You don't need backtracking if you have NFA in your disposal, and I think
> >you agreed that NFAs are still regular.
>
> Are you claiming that every perl search can be implemented as an NFA?
> If not, then I fail to see the relevance of the above argument.
I think there are a lot of problems with semantics here (which is, I believe,
the whole reason you brought up this thread :). In calculus theory,
NFAs and DFAs provide different ways to view the same thing. In computing
*practice*, the terms ``NFA'' and ``DFA'' are not necessiarly strongly
related to their mathematical background. Perhaps ``NFA style'' etc., would
be better.
You're certainly right that many Perl regex constructs can not be
represented with a (calculoulslypure) NFA (or DFA, or anything else
regular, since Perl regular expressions aren't regular  as you well know :).
But then, that's OK with me  Perl regular expressions are not calculus,
they're a programming language within the Perl programming language.
Tom C jumped in with:
> Don't we have anything more important to argue about?
Like how to use the term ``email'' in English? /:
To see why /^a.*?b.*?c/ does not require backtracking, implement it as a
finite state machine, as described below. Begin in state 'start'. The
state machine scans its input and makes a state transition on each input
character. Once we have seen an input character and made a state
transition we never need to examine that character again. Hence no
backtracking. EOF represents the end of the input stream:
start: if input is 'b' go to state 'sawb' else if input is EOF
go to state 'fail' else go to (aka remain in) state 'start'.
sawb: if input is 'c' go to state 'done' else if input is EOF
go to state 'fail' else go to state 'sawb'.
done: declare a match.
fail: declare failure.
Note that 'does not require backtracking' means 'at least one
implementation does not require backtracking'.
> start: if input is 'b' go to state 'sawb' else if input is EOF
> go to state 'fail' else go to (aka remain in) state 'start'.
> sawb: if input is 'c' go to state 'done' else if input is EOF
> go to state 'fail' else go to state 'sawb'.
> done: declare a match.
> fail: declare failure.
Unless I'm missing something, you need another state; perhaps rename "start"
to "sawa" and add
start: if input is 'a' go to state 'sawa' else if input is EOF
go to state 'fail' else go to (aka remain in) state 'start'.
Bennett
> >You don't need backtracking if you have NFA in your disposal, and I think
> >you agreed that NFAs are still regular.
> Are you claiming that every perl search can be implemented as an NFA?
No. But YOU claimed that every language that requires backtracking (in a
certain IMPLEMENTATION you seem to have in mind) is not regular, which is not
true. And you gave several examples, which all fail to be nonregular.
> If not, then I fail to see the relevance of the above argument.
Perl regexps do not specify regular languages. But not for the reasons you
believe them to.
>
> > /^a.*?b.*?c/ # nongreedy, no backtracking, fast, regular expression
>
> To see why /^a.*?b.*?c/ does not require backtracking, implement it as a
> finite state machine, as described below.
And exactly the same FA works for the greedy ^a.*b.*c
>Unless I'm missing something, you need another state; perhaps rename "start"
>to "sawa" and add
> start: if input is 'a' go to state 'sawa' else if input is EOF
> go to state 'fail' else go to (aka remain in) state 'start'.
Yes, you are right. The whole thing, revised, is below.
===
Finite state machine to implement /^a.*?b.*?c/ .
Begin in state 'start'. The state machine scans its input and makes a
state transition on each input character. Once we have seen an input
character and made a state transition we never need to examine that
character again. Hence no backtracking. EOF represents the end of the
input stream:
start: if input is 'a' go to state 'sawa' else go to state 'fail'.
sawa: if input is 'b' go to state 'sawb' else if input is EOF
go to state 'fail' else go to (aka remain in) state 'sawa'.
sawb: if input is 'c' go to state 'done' else if input is EOF
go to state 'fail' else go to state 'sawb'.
done: declare a match.
fail: declare failure.
CLAIM: /^a.*b.*c/ (greedy) and /^a.*?b.*?c/ (nongreedy) both match the
same strings.
RESPONSE: Any string that in its entirety is matched by /^a.*b.*c/ is
also matched by /^a.*?b.*c/ . But what if the input stream is not
entirely matched by either? Then, in normal perl code operation, we may
end up matching different substrings of the input stream by the two
patterns. Thus it's incorrect to say that both match the same strings
unless you carefully define just what you mean by 'match' and 'the same
strings'.
COMMENT: Perl and similar pattern matching has one of two goals:
(a) either deciding whether an entire input string is matched by a given
pattern, or (b) searching the input string for some substring that is
matched by a given pattern. Language parsers, on the other hand, care
not much about whether or not we can find a substring that can be parsed
using the rules of the language  we generally want to parse the entire
input stream. If we find a portion that does not fit our grammar we
declare a parse error.
Therefore blindly extending the concepts of language grammars to pattern
matching will lead to much confusion.
CLAIM: /^a.*?b.*?c/ requires backtracking.
RESPONSE: See finite state machine in another posting.
CLAIM: In /^a.*?b.*?c/ the ``lookahead'' for the first .*? is /b.*?c/,
not simply one character.
RESPONSE: See finite state machine in another posting.
CLAIM: /^a.*?b.*?c/ (nongreedy) and /^a.*b.*c/ (greedy) can be
implemented with exactly the same finite state machine.
RESPONSE: The greedy /^a.*b.*c/ cannot be implemented with a finite
state machine. The first time that we see a 'b' we cannot know if this
is the 'b' we want. Want want the longest possible match for the .*
preceding the 'b'. So we must keep scanning for more occurrences of 'b'
But if we do and find none, then we have lost all the input symbols so
far and cannot make any more decisions based on them. A finite state
machine has no memory of the symbols it has already seen.
CLAIM: /^a.*?b.*?c/ requires backtracking but /^a[^b]*?b[^c]*?c/ does not.
RESPONSE: There is no difference  the simplest finite state machines
that implement both are identical.
CLAIM: Anything that is generated by a regular language is a regular
expression.
RESPONSE: Only if you remain within the spirit of regular languages and
ask only this question: Whether or not the entire input stream is a
string that belongs to the regular language. As soon as you start
looking for any *substring* in the input stream that is in the regular
language, you are misusing the terminology.
CLAIMS: "How a matcher is implemented (backtracking etc) has no bearing
on the language it recognizes" and "Backtracking is an implementation
feature of the perl pattern matcher (ok, of any reasonable pattern
matcher) that has nothing to do with regularity."
RESPONSE: Certain pattern matches can be implemented without
backtracking and certain others cannot. You can always be sloppy and do
backtracking where it's not needed. But if something absolutely
requires backtracking, then you can't do it without backtracking.
NOW WE CAN SAY IT:
A regular expression is a pattern one for which we can declare a match
or no match, while scanning an input string, without having to rescan
the input string. If we must ever rescan the input string, then the
pattern we are trying to match is not a regular expression.
Almost no perl patterns are regular expressions.
Kevin
<kev...@ns.net>
> > Now the second one needs no backtracking:
> > find an 'a' or fail;
> The "or fail" is, by definition, backtracking.
Why do you say so? Eg /a/ will fail on "b", and I don't see how it could
possibly backtrack.
> > > /^a[^b]*?b[^c]*?c/
> > This one won't need backtracking indeed.
> Sure it does... when the [^b] fails, it can backtrack to the ``gee, because
> of the * I didn't really need to attempt [^b]
When would the [^b] fail? I read teh above thus: check that the first char is
a, then gobble up all ^b's, check that there is a next char (it would be b),
gobble up all ^c's, check for a c.
> in common practice, an ``NFA'' is usually endowed with a much more
> expressive language.
What do you mean by "more expressive in common practice"? If exactly the same
languages (regular ones) are recognized by both NFAs and DFAs, which is more
expressive? You can't make a distinction unless you also consider execution
time etc.
> To a DFA (awk, lex, etc.), /aand/ and /anda/ and /a(nd?)/ are *exactly*
> the same.
Ok (though you seem to have a typo in the last one).
> Yet they're quite different with Perl and the majority of regex engines out
> there.
From an implementation point of view? So what? We were talking about the math
term "regular language" and whether perl regexps implement those or more or
less. The first two regexps specify the same language (set of strings) in
either formal langauges or perl (unless the perl implementation is wrong).
> > The greedy/nongreedy > distinction doesn't come into play here at all.
> If they're the same, then so are /.*/ and /.*?/, and if the target string
> has any [nonnewline] characters they're guaranteed to be different.
They match THE SAME SET OF STRINGS, though they pick up the match differently.
Both specify a regular language.
> CLAIM: /^a.*b.*c/ (greedy) and /^a.*?b.*?c/ (nongreedy) both match the
>
> RESPONSE: what if the input stream is not entirely matched by either? Then,
> in normal perl code operation, we may end up matching different substrings
> of the input stream by the two patterns. Thus it's incorrect to say that
> both match the same strings
It's quite correct to say so: /$re1/ and /$re2/ will return 1 for the same set
of values of $_.
> COMMENT: Perl and similar pattern matching has one of two goals:
> (a) either deciding whether an entire input string is matched by a given
> pattern,
Right. But regular languages only have this goal.
> or (b) searching the input string for some substring that is
> matched by a given pattern.
If you claim that something does not specify a regular language, and if you
use the term in its mathematical sense, you have to stick with (a).
> Therefore blindly extending the concepts of language grammars to pattern
> matching will lead to much confusion.
No, blindly using the term "regular expression" in its math sense without
conforming to what Chomsky and Kleene used it for is what leads to confusion.
> CLAIM: /^a.*?b.*?c/ (nongreedy) and /^a.*b.*c/ (greedy) can be
> implemented with exactly the same finite state machine.
>
> RESPONSE:
Before I consider your response, let me reiterate my argument. Both the above
regexps require a string that starts with a and has a prefix also containing b
and c. Thus your "sawa, sawb" FA works for the second regexp too. It is true
that it would match the "wrong" c (not the last possible one), but this
doesn't matter. Because FA's *cannot specify matches*, it can only say whether
the string matched or not. At least the outputless FAs that are used in
regular language thory.
BTW there's a moot point in the definition of greedy *. How exactly should
this be maximized? Does the first .* take precedence over the second .*? Eg
for the string abbc, is
a .* b .* c a better match than a .* b .* c ?
a b b c a b b c
> The greedy /^a.*b.*c/ cannot be implemented with a finite
> state machine.
Well, how do you expect the FA to report the matches? If you want it to say "I
matched the b from the regexp" as soon as it's seen the correct b, then
obviously it can't do that, because this depends on whether there are more b's
(and c's0 in the input.
> CLAIM: Anything that is generated by a regular language is a regular
> expression.
> RESPONSE: Only if you remain within the spirit of regular languages
But certainly you should remain within this spirit. Otherwise what's the THIRD
sense (neither formal languages nor perl) in which you're using the term?
> As soon as you start looking for any *substring* in the input stream that is
> in the regular language, you are misusing the terminology.
Right ;) Formal regular languages are not concerned with the way something
matches, only with whether it matches.
> CLAIMS: "How a matcher is implemented (backtracking etc) has no bearing
> on the language it recognizes"
Now an example showing that nongreedy matching may require backtracking:
"abbc" =~ /a.*?b[^b]c/ The .*? will first match "" but then it will have to
reconsider and match the first b.
>I'll make a deal: you get everyone to first rename their grep programs
>to gspp, and I'll change the perl docs. Remember that grep allows \1
>stuff in its regular expressions.
But gspp means nothing to newcomers, and neither does grep, so both are
arbitrary terms, and grep is at least easily pronounced. I see no harm
in short arbitrary names.
'Search pattern' has the correct intuitive meaning for new perl users.
'Regular expression' asks the nonmathematical reader to believe that we
are talking about a special type of expression, one that is "regular" in
some sense, which is not really true. And 'regular expression' invites
the reader with a mathematical background to be misled into thinking
that such an expression might be implemented with a finite state
machine, which is generally not true either.
So I propose the following to all who read this:
In the future, whenever you have a choice, say or write 'pattern'
or 'search pattern' or 'perl pattern' as appropriate instead of
'regular expression'. It will be easier for your nontechnical
audience to understand what you mean, it will mislead fewer people,
and you even have to type fewer characters!
It looks like a winwin situation to me.
Gee, you mean like the variations in spelling and use surrounding the
word 'email'?

Jeff Stampes  Xilinx, Inc.  Boulder, CO  jeff.s...@xilinx.com
>Now an example showing that nongreedy matching may require backtracking:
>"abbc" =~ /a.*?b[^b]c/ The .*? will first match "" but then it will have to
>reconsider and match the first b.
Note that "abbc" will never match /a.*?b[^b]c/.
But anyway:
If you make this anchored, then backtracking is not required. Let's
implement a state machine for /^a.*?b[^b]c/ . An implicit fail on EOF
is assumed unless we have already declard success.
start: if 'a' goto dotstar else fail.
dotstar: if 'b' goto sawb else remain in dotstar.
sawb: if 'b' goto sawb else goto wantc.
wantc: if c declare success else if b goto sawb else goto dotstar.
> >Now an example showing that nongreedy matching may require backtracking:
> >"abbc" =~ /a.*?b[^b]c/
> Note that "abbc" will never match /a.*?b[^b]c/.
Sorry about that. What I meant was /a.*?b[^b]*?c/ (I came up with this through
minimal modification of your original example). Now THIS requires
backtracking. The idea is that a later part of the regexp makes an early empty
match on .*? to fail.
However, I need to qualify the above statement, because it *depends on the
implementation*. I should say "needs backtracking in the most obvious
implementation". In fact, below is an implementation that doesn't require
backtracking. It illustrates the methodology of transforming a NFA to a DFA.
Start with the most obvious NFA that implements /a.*?b[^b]*?c/. It can be
generated automatically from the regexp by composing singlestate FAs.
(I assume the alphabet is just a,b,c.)
input a b c
state 0 1   The table entries are the new states.
1 1 1,2 1 1,2 and 2,3 designate nondeterminism (either state).
2 2  2,3
Now we have to complete this table with the new pseudostates 1,2 and 2,3:
input a b c
state 1,2 1,2 1,2 1,2,3
2,3 2  2,3
1,2,3 1,2 1,2 1,2
Yet another pesudostate 1,2,3 appears, and we complete with it too.
Now start again from 0 and see which states are reachable. Some of the new
states subsume old states and thus make them useless.
input a b c
state 0 1  
1 1 1,2 1
1,2 1,2 1,2 1,2,3
1,2,3 1,2 1,2 1,2
The final state is 1,2,3 because it contains the original final state 3.
Or if we give some more symbolic names to our states:
input a b c
state start sawa  
sawa sawa sawb sawa
sawb sawb sawb sawc
sawc sawb sawb sawb
Where sawc is the final success state (there's transitions going "back" from
it, but that's legit).
The important point here is that we *can't talk* about the nongreediness
condition here: none of the states corresponds precisely to the .*? part in
the original regexp, so the DFA can't report when it has matched it. Indeed,
for the string "aabc" the .*? matches "a" and ends up in the sawa state, while
for "abbc" it matches "b" in the sawb state.
In fact it's impossible to build a DFA where a specific state corresponds to
.*? Consider the regexp again: /a.*?b[^b]*?c/ matches any string containing
a,b,c (in that order), and the b in the regexp is forced to match the last b
of the string (there can be no b's after it). If the state q of a DFA would
correspond to the .*?, then a b transition from it would lead us to a state r
that would recognize the last b in the string. But the DFA can't know it has
reached the last b before it has seen all of the string. Therefore there
cannot be any transitions out of r, but there must be a c transition out of
it.