term 'regular expressions' considered undesirable

14 views
Skip to first unread message

Rahul Dhesi

unread,
Mar 15, 1997, 3:00:00 AM3/15/97
to

I am concerned that the term 'regular expression' has come to mean any
pattern. In fact a 'regular expression' originally referred to a very
specific type of pattern:

one that can converted into a finite state machine

Another way of saying is that a 'regular expression' is a pattern that
can be used for matching with no backtracking needed. Such pattern
matches are fast.

My guess is that early implementations of regular expressions were in
fact implemented as finite state machines, but people decided they were
not powerful enough. So they kept on adding extensions until the
original goal of having regular expressions, i.e., to match with no
backtracking, was lost.

The ? suffix in perl5, which causes quantifiers to become non-greedy,
help bring perl paterns close to real regular expressions.

/a.*b.*c/ # greedy, needs backtracking, slow, not regular expression
/a.*?b.*?c/ # non-greedy, no backtracking, fast, regular expression

Perhaps it's now too late, and we need to find a new term for what
we used to call regular expressions.
--
Rahul Dhesi <dh...@spams.r.us.com>
a2i communications, a quality ISP with sophisticated anti-junkmail features
** message body scan immune to fake headers *** see http://www.rahul.net/
>>> "please ignore Dhesi" -- Mark Crispin <m...@CAC.Washington.EDU> <<<

John Kodis

unread,
Mar 15, 1997, 3:00:00 AM3/15/97
to

Previously, Rahul Dhesi wrote:

>The ? suffix in perl5, which causes quantifiers to become non-greedy,
>help bring perl paterns close to real regular expressions.
>
> /a.*b.*c/ # greedy, needs backtracking, slow, not regular expression
> /a.*?b.*?c/ # non-greedy, no backtracking, fast, regular expression
>
>Perhaps it's now too late, and we need to find a new term for what
>we used to call regular expressions.

My understanding is that both of the examples above are regular
expressions. Regular expressions are capable of supporting r* (Kleen
closures), r+ (positive closures), and r? (whatever this is called).

As far as I know, the only non-regular features of Perl regular
expressions is the ability to back-reference a previous match using
the \1 and related patterns. This back-referencing ability allows
Perl regular expressions to recognise balanced and nested constructs,
two of the classic ``no can do'' types of patterns that can only be
expressed in a context-free or more powerful grammer.

[ References available is sec 3.3 of the red dragon book, as well as
most every other compiler or formal language theory book ever
written. ]

So while Perl regular expressions aren't completely regular, they're
close enough that I'm not uncomfortable referring to them by that
term. I don't that a new term is called for (even though I think that
``irregular expressions'' has a nice ring to it).

-- John Kodis.

Abigail

unread,
Mar 17, 1997, 3:00:00 AM3/17/97
to

On 15 Mar 1997 15:21:22 GMT, John Kodis wrote in comp.lang.perl.misc:
++ Previously, Rahul Dhesi wrote:
++
++ >The ? suffix in perl5, which causes quantifiers to become non-greedy,
++ >help bring perl paterns close to real regular expressions.
++ >
++ > /a.*b.*c/ # greedy, needs backtracking, slow, not regular expression
++ > /a.*?b.*?c/ # non-greedy, no backtracking, fast, regular expression
++ >
++ >Perhaps it's now too late, and we need to find a new term for what
++ >we used to call regular expressions.
++
++ My understanding is that both of the examples above are regular
++ expressions. Regular expressions are capable of supporting r* (Kleen
++ closures), r+ (positive closures), and r? (whatever this is called).
++
++ As far as I know, the only non-regular features of Perl regular
++ expressions is the ability to back-reference a previous match using
++ the \1 and related patterns. This back-referencing ability allows
++ Perl regular expressions to recognise balanced and nested constructs,
++ two of the classic ``no can do'' types of patterns that can only be
++ expressed in a context-free or more powerful grammer.

I do not think the look-a-head assertions are regular. I would be
surprised if they are. AFAIK the class of "all languages minus a
regular one" isn't regular, and you can describe that language with
/^(?!$regex$)/. (You can do intersection and subtraction of RE's
with look-a-head too.)

Abigail


Steve Fink

unread,
Mar 18, 1997, 3:00:00 AM3/18/97
to abi...@ny.fnx.com

My apologies for putting this on a perl newsgroup, but it sorta kinda
relates.

> AFAIK the class of "all languages minus a
> regular one" isn't regular, and you can describe that language with
> /^(?!$regex$)/. (You can do intersection and subtraction of RE's
> with look-a-head too.)

Actually, it's more clearly described as "all strings not matched by a
given regex", which is regular, for a very simple reason that is very
non-simple until someone tells you the trick. ;)

Simple explanation: there is a 1-1 correspondence between regexes and
NFAs. To negate an NFA, make every accepting state non-accepting, and
vice versa.

It seems like intersection could be done by... er... taking the
cartesian product of the power sets of the states in the two NFAs, and
then making a state accepting iff it contains at least one accepting
state from the first NFA *and* from the second NFA. But I'm just doing
that off the top of my head and I suck at these sorts of things. For all
I know, regular languages may not be closed under intersection.

So now can anyone give me the simple explanation for why language
ambiguity is undecidable? I should know this, but I don't and I'm too
lazy to look it up. (And please don't send it to the perl newsgroup!)

Vladimir Alexiev

unread,
Mar 20, 1997, 3:00:00 AM3/20/97
to

In article <E77HB...@nonexistent.com> abi...@ny.fnx.com (Abigail) writes:

> ++ Previously, Rahul Dhesi wrote:
> ++ > /a.*b.*c/ # greedy, needs backtracking, slow, not regular expression

My comments on the above comments:
- right
- depends on the implementation
- maybe
- wrong. This is as regular as it gets.

> ++ > /a.*?b.*?c/ # non-greedy, no backtracking, fast, regular expression

- right
- often would need backtracking, eg for the string axbbc the matcher will
first pick x for the first .*?, but then will have to reconsider and pick
xb.
- sometimes
- yes it is regular. The greedy/nongreedy distinction doesn't affect
regularity, but only what parts of the string are mathched to what part of
the regexp.

> ++ Regular expressions are capable of supporting r* (Kleen


> ++ closures), r+ (positive closures), and r? (whatever this is called).

A shortcut for r|empty. The non-greedy modifier ? is not part of the
formal-language definition of regexp.

> ++ the \1 and related patterns. This back-referencing ability allows
> ++ Perl regular expressions to recognise balanced and nested constructs,

No, it doesn't. But it's true that \1 is non-regular.

> AFAIK the class of "all languages minus a regular one" isn't regular,

What exactly do you mean by the above construction? Because the complement of
a regular language R (the set of all strings not in R) is certainly regular.
Just take the corresponding automaton and "invert" it: make all accepting
states non-accepting and vice versa.

> (You can do intersection and subtraction of RE's with look-a-head too.)

The intersection of two regular languages is regular. Take the two automata A1
and A2 and make their product A3. The states of A3 are pairs <q1,q2> of states
of A1 and A2 and the transitions of A3 are <q11,q21> -a-> <q12,q22> iff
q11 -a-> q12 in A1 and q21 -a-> q22 in A2. The accepting states of A3 are
<q1,q2> where q1 is accepting in A1 and q2 is accepting in A2.

By the intersection and complement results, it also follows that the
difference of two regular languages is also regular. The same holds for union,
though a more direct proof is available: take the disjoint union of A1 and A2
(except merge their initial states); this automaton is non-deterministic, but
it still recognizes a regular language.

But, this is not a theory group :-)

Vladimir Alexiev

unread,
Mar 20, 1997, 3:00:00 AM3/20/97
to

In article <332F2B...@cs.berkeley.edu> Steve Fink <sf...@cs.berkeley.edu> writes:

> My apologies for putting this on a perl newsgroup, but it sorta kinda
> relates.

You shouldn't apologize for educating people :-) Although I've read somewhere
that lecturers at 16th century universities usually started with "Esteemed
colleagues, please kindly allow me to bore you with the material I have so
carefully prepared".

> It seems like intersection could be done by... er... taking the
> cartesian product of the power sets of the states in the two NFAs

If you start with DFAs, you can simply take the cartesian product of the sets
of states, not the power sets.

> For all I know, regular languages may not be closed under intersection.

They are.

Jeffrey

unread,
Mar 20, 1997, 3:00:00 AM3/20/97
to Rahul Dhesi

[mail and post]

Rahul Dhesi <c.c....@33.usenet.us.com> wrote:
|> /a.*b.*c/ # greedy, needs backtracking, slow, not regular expression

|> /a.*?b.*?c/ # non-greedy, no backtracking, fast, regular expression

There's no fundamental difference in the above two, other than in which of
the plausable matches will be selected. Both do backtrack, (and, for that
matter, will do so exactly an equal number of times for a non-match). There
are times when the first will backtrack much more than the second, and
times when the second will backtrack much more than the first.

All other things being equal, the first (the one you say is slow) will be
faster than the second, but this is due to some of Perl's internal
optimizations. In any case, knowledge of the likely data (and the
requirements of the task, of course) often make one better than the other.

|> Perhaps it's now too late, and we need to find a new term for what

|> we used to call regular expressions.

Perhaps it's best to differentiate between the calculus phrase ``regular
expression'', coined by Dr. Stephen Kleene in the 40s, and the use of the
same term in modern computer science. There are many situations in language
where the meanings of words have changed. Some examples:

When I was growing up, ``How are you doing?'' was not a statement.

A Japanese futon has no relationship to what is called a "futon" in
American English.

Tennis shoes are rarely used for tennis.

The little eraser-head pointing devices on laptops are called mice.

Ah, well, I could go on forever. But basically, the term ``regular
expression'' has no implied meaning to most people that first come across
it in Perl or whatnot, so little confusion arises. It would have been nice
had some other word been used from the start (perhaps one giving more of a
hint as to what the thing is all about), but for the most part we're stuck
with history. I suppose you could start a crusade to use a different
phrase. How about 'whachamacallit'? :-)

Jeffrey
----------------------------------------------------------------------------
Jeffrey Friedl <jfr...@ora.com>
See my Jap<->Eng dictionary at http://www.wg.omron.co.jp/cgi-bin/j-e
O'Reilly's Regular Expression book: http://enterprise.ic.gc.ca/~jfriedl/regex/


Rahul Dhesi

unread,
Mar 21, 1997, 3:00:00 AM3/21/97
to

I got email from Jeffrey Friedl and I'm not sure if it was private email
or a copy of a posting. In any case, based on his comment's here's a
followup posting.

In <5gcv7d$j...@samba.rahul.net> I wrote:

> /a.*b.*c/ # greedy, needs backtracking, slow, not regular expression
> /a.*?b.*?c/ # non-greedy, no backtracking, fast, regular expression

The above examples would have been better written as anchored:

/^a.*b.*c/ # greedy, needs backtracking, slow, not regular expression
/^a.*?b.*?c/ # non-greedy, no backtracking, fast, regular expression

Now the second one needs no backtracking:

find an 'a' or fail;
skip past all non-'b' characters;
find the 'b' or fail;
skip past all non-'c' characters;
find the 'c' or fail;
succeed

A non-anchored version simply repeats the search at every point in the
search string until it succeeds or, if it runs out of string, fails. So
the non-anchored non-greedy version will also need no backtracking.

I am assuming that one-character lookahead is not considered to be
backtracking. If one-character lookahead is backtracking, then here are
better examples:

/^a.*b.*c/ # greedy, needs backtracking, slow, not regular expression
/^a[^b]*b[^c]*c/ # non-greedy, no backtracking, fast, regular expression

The important point in all this is that some patterns are regular
expressions and others are not, and calling all of them regular
expressions is not a good idea. The term 'search pattern' or just
'pattern' is nicely generic and does not cause any confusion. Those
being introduced to search patterns for the first time will find no
useful meaning in the term 'regular expression' anyway, since it's not
an English term but a mathematical term. If we must use a mathematical
term, we should use it correctly.

Ilya Zakharevich

unread,
Mar 21, 1997, 3:00:00 AM3/21/97
to

[A complimentary Cc of this posting was sent to Vladimir Alexiev
<vlad...@cs.ualberta.ca>],
who wrote in article <omk9n2f...@tees.cs.ualberta.ca>:

> In article <E77HB...@nonexistent.com> abi...@ny.fnx.com (Abigail) writes:
>
> > ++ Previously, Rahul Dhesi wrote:
> > ++ > /a.*b.*c/ # greedy, needs backtracking, slow, not regular expression
> My comments on the above comments:
> - right
> - depends on the implementation
> - maybe
> - wrong. This is as regular as it gets.
> > ++ the \1 and related patterns. This back-referencing ability allows
> > ++ Perl regular expressions to recognise balanced and nested constructs,

> No, it doesn't. But it's true that \1 is non-regular.

- depends on the implementation
$balanced = study /(
[^\(\)]+
|
\(
(?>$balanced)
\)
)*/x ;
- maybe

Ilya

P.S. study and (?>...) are not implemented yet, but close. As is
an optimization of (a+|b)* for most a and b.

Clay Irving

unread,
Mar 21, 1997, 3:00:00 AM3/21/97
to

In <JFRIEDL.97...@tubby.nff.ncl.omron.co.jp> jfr...@tubby.nff.ncl.omron.co.jp (Jeffrey) writes:

[...]

>Ah, well, I could go on forever. But basically, the term ``regular
>expression'' has no implied meaning to most people that first come across
>it in Perl or whatnot, so little confusion arises. It would have been nice
>had some other word been used from the start (perhaps one giving more of a
>hint as to what the thing is all about), but for the most part we're stuck
>with history. I suppose you could start a crusade to use a different
>phrase. How about 'whachamacallit'? :-)

And I highly recommend the second edition of Jeffrey's Book called,
"Mastering Whachamacallits"


--
Clay Irving See the happy moron,
cl...@panix.com He doesn't give a damn,
http://www.panix.com/~clay I wish I were a moron,
My God! Perhaps I am!

Vladimir Alexiev

unread,
Mar 21, 1997, 3:00:00 AM3/21/97
to

In article <5gt41k$c...@samba.rahul.net> c.c....@33.usenet.us.com (Rahul Dhesi) writes:

> The above examples would have been better written as anchored:
>
> /^a.*b.*c/ # greedy, needs backtracking, slow, not regular expression
> /^a.*?b.*?c/ # non-greedy, no backtracking, fast, regular expression

Still wrong. Both match exaclt the same strings. Greedy matching says "return
(one of) the way(s) to match which maximize the (leftmost) greedy *".
Non-greedy match says "return (one of) the way(s) to match which minimizes the
(leftmost) non-greedy match". Neither changes the set of matching strings, nor
the set of ways to match.

> Now the second one needs no backtracking:
> find an 'a' or fail;
> skip past all non-'b' characters;

Wrong. Consider /^a.*?b[^b]*?c/ and the string 'abbc'. It will initially pick
the first b in the string as the b in the regexp, assuming that the .*? is
empty. Then it will have to reconsider and assign one b to the .*?

> A non-anchored version simply repeats the search at every point in the
> search string

This would be a very naive and slow way to implement it.

> /^a[^b]*?b[^c]*?c/ # non-greedy, no backtracking, fast, regular expression
This one won't need backtracking indeed.

> some patterns are regular expressions and others are not, and calling all of
> them regular expressions is not a good idea.

You have a very wrong idea of what is a regular expression, I'm afraid.
Regular languages are ones that can be produced using regular operations:
empty, letter, concatenation, alternative, and star. The greedy/non-greedy
distinction doesn't come into play here at all. Some of the perl extensions
still keep it regular (eg foo{m,n} is just foo{m}|foo{m+1}|..|foo{n} where
foo{x} stands for 'foo foo ..<x times>.. foo') and some don't (eg (.*)\1 is
not even a context-free language).

How a matcher is implemented (backtracing etc) has no bearing on the langauge
it recognizes (unless it's implemented incorrectly :-). Consider this: DFA
(deterministic automata) and NFA (non-deterministic ones) have the same
expressive power, namely that of regular languages. The standard construction
that converts a NFA into an equivalent DFA takes as states of the DFA to be
the powerset of the states of the NFA. Thus if the NFA has n states, the DFA
has 2^n. Which doesn't mean that any self-respecting programmer will implement
it that way.

> If we must use a mathematical term, we should use it correctly.

Indeed ;-)

Tom Christiansen

unread,
Mar 21, 1997, 3:00:00 AM3/21/97
to

[courtesy cc of this posting sent to cited author via email]

In comp.lang.perl.misc,
c.c....@33.usenet.us.com (Rahul Dhesi) writes:
:The important point in all this is that some patterns are regular


:expressions and others are not, and calling all of them regular

:expressions is not a good idea. The term 'search pattern' or just


:'pattern' is nicely generic and does not cause any confusion. Those
:being introduced to search patterns for the first time will find no
:useful meaning in the term 'regular expression' anyway, since it's not

:an English term but a mathematical term. If we must use a mathematical


:term, we should use it correctly.

I'll make a deal: you get everyone to first rename their grep programs to
gspp, and I'll change the perl docs. Remember that grep allows \1 stuff
in its regular expressions.

Don't we have anything more important to argue about?

--tom
--
Tom Christiansen tch...@jhereg.perl.com

"Yes, you can think of almost everything as a function, but this may upset
your wife." --Larry Wall

Rahul Dhesi

unread,
Mar 22, 1997, 3:00:00 AM3/22/97
to

In <ombu8df...@tees.cs.ualberta.ca> Vladimir Alexiev
<vlad...@cs.ualberta.ca> writes:

>Consider this: DFA
>(deterministic automata) and NFA (non-deterministic ones) have the same
>expressive power, namely that of regular languages.

Therefore to decide whether a perl pattern is a regular expression we
need only ask if we can create either a DFA or an NFA to implement the
search.

In general we cannot implement the search for most perl patterns
with DFAs or NFAs because most of them require backtracking.

Because to implement backtracking you must augment the DFA or NFA with
memory, and once you add memory, it's no longer a DFA or NFA. A DFA or
NFA has no memory, other than knowing what state it is in. It does not
remember how it got to that state, so it cannot backtrack.

Vladimir Alexiev

unread,
Mar 22, 1997, 3:00:00 AM3/22/97
to

In article <5gvvha$f...@samba.rahul.net> c.c....@33.usenet.us.com (Rahul Dhesi) writes:

> In general we cannot implement the search for most perl patterns
> with DFAs or NFAs because most of them require backtracking.

Backtracking is an implementation feature of the perl pattern matcher (ok, of
any reasonable pattern matcher) that has nothingnto do with regularity.

> Because to implement backtracking you must augment the DFA or NFA with
> memory, and once you add memory

I don't know the precise sense in which you use "backtracking" above, but in
eg Prolog it is used to implement nondeterminism (to return all solutions to a
problem). You could find all solutions to a problem with a device that's
capable of cloning itself at every choice point. Which a NFA precisely is.

> It does not remember how it got to that state, so it cannot backtrack.

Nor does it need to. Backtracking is a hack used to model nondeterminism (in a
sense, parallelism) on a sequential computer. You don't need backtracking if
you have NFA in your disposal, and I think you agreed that NFAs are still
regular. (BTW they don't have a higher expressive power than DFA for a "wrong
reason": the simulation of a NFA by a DFA explodes the state space
exponentially, but the exponentiation of a finite number is still finite. This
becomes a concern when you care about the "complexity" of an algorithm, eg
NP-complete problems [the epitome of "exponentially hard" problems] can be
solved in polytime on a NTM; in fact that is the definition of NP.)

> >>> "please ignore Dhesi" -- Mark Crispin <m...@CAC.Washington.EDU> <<<

In this case I fully agree with Mark.

Rahul Dhesi

unread,
Mar 23, 1997, 3:00:00 AM3/23/97
to

In <omvi6kv...@tees.cs.ualberta.ca> Vladimir Alexiev
<vlad...@cs.ualberta.ca> writes:

>You don't need backtracking if
>you have NFA in your disposal, and I think you agreed that NFAs are still
>regular.

Are you claiming that every perl search can be implemented as an NFA?
If not, then I fail to see the relevance of the above argument.


--
Rahul Dhesi <dh...@spams.r.us.com>
a2i communications, a quality ISP with sophisticated anti-junkmail features
** message body scan immune to fake headers *** see http://www.rahul.net/

Jeffrey

unread,
Mar 23, 1997, 3:00:00 AM3/23/97
to

Clay Irving <cl...@panix.com> wrote:
|> >with history. I suppose you could start a crusade to use a different
|> >phrase. How about 'whachamacallit'? :-)
|>
|> And I highly recommend the second edition of Jeffrey's Book called,
|> "Mastering Whachamacallits"

Mmmm, somehow this reminds me of the last line of the first paragraph
on page 71 (of the first edition :-)

Jeffrey
----------------------------------------------------------------------------
Jeffrey Friedl <jfr...@ora.com>

O'Reilly & Associates' _Mastering Regular Expressions_
http://enterprise.ic.gc.ca/~jfriedl/regex/

Jeffrey

unread,
Mar 23, 1997, 3:00:00 AM3/23/97
to Rahul Dhesi

[mail and post and cc]

Rahul Dhesi <c.c....@33.usenet.us.com> wrote:
|> I got email from Jeffrey Friedl and I'm not sure if it was private email
|> or a copy of a posting. In any case, based on his comment's here's a
|> followup posting.

I wish you would have posted my original note, since your conclusions were
pretty unrelated to what I wrote. ["private email" refers to the method of
sending, not the contents -- like with all other means of human
communication, I rely only on your sense of right and wrong to not divulge
that which I intend to be private between you and me -- obviously, that
doesn't apply here, so you should have felt free to post it if you weren't
sure if it had been. This is a mail and post, though, so need here]

|> /^a.*b.*c/ # greedy, needs backtracking, slow, not regular expression
|> /^a.*?b.*?c/ # non-greedy, no backtracking, fast, regular expression
|>

|> Now the second one needs no backtracking:
|>
|> find an 'a' or fail;

The "or fail" is, by definition, backtracking. As I told you in my earlier
message, both will backtrack. There are situations where the 2nd will
backtrack more than the 1st. If there is no match, they'll both backtrack
exactly the same number of times (since they both match the same possible
sets of strings, but in different orders, they both must test all
permutations before a non-match is reported, and the order doesn't
influence the number).

|> I am assuming that one-character lookahead is not considered to be
|> backtracking.

That's a silly assumption, but also irrelevant to the problem at hand. The
``lookahead'' for the first .*?, for example, is /b.*?c/, not simply one
character.

|> If one-character lookahead is backtracking, then here are better
|> examples:
|>

|> /^a[^b]*b[^c]*c/ # non-greedy, no backtracking, fast, regular expression

This indeed will match faster, in general, but still requires backtracking.
Any match will, by definition, have entailed 2 backtracks.

|> The important point in all this is that some patterns are regular
|> expressions and others are not, and calling all of them regular
|> expressions is not a good idea.

I don't know what this comment has to do with the rest of your post (is it
supposed to be related?), but to a preponderance of people that *use* these
things, ``regular expression'' has just as much implied meaning as ``bob'',
so it's not a hindrance. Sure, it's not what the algebraic term ``regular
expression'' means, but so what? I guess if it hits your hot button you'd
care. I don't happen to think it's a big deal, but don't deny you your right
to think it is :-)

|> The term 'search pattern' or just
|> 'pattern' is nicely generic and does not cause any confusion.

Indeed.

|> If we must use a mathematical term, we should use it correctly.

You mean like ``addition'' and the like? The computation meaning of
these terms is SUBSTANTIALLY different from the real meaning -- in
real life, mathematical ops don't lose precision, overflow, etc.
Seems to be in the same league to me.

Vladimir Alexiev <vlad...@cs.ualberta.ca> followed up:


|> > /^a[^b]*?b[^c]*?c/ # non-greedy, no backtracking, fast, regular expression
|> This one won't need backtracking indeed.

Sure it does... when the [^b] fails, it can backtrack to the ``gee, because
of the * I didn't really need to attempt [^b], so let's skip it'' state.

|> How a matcher is implemented (backtracing etc) has no bearing on the
|> langauge it recognizes (unless it's implemented incorrectly :-).

|> Consider this: DFA (deterministic automata) and NFA (non-deterministic
|> ones) have the same expressive power

In theory (and some practice), an NFA and DFA can be made to match the
exact same language, but in common practice, an ``NFA'' is usually endowed
with a much more expressive language.

To a DFA (awk, lex, etc.), /a|and/ and /and|a/ and /a(nd?)/ are *exactly*
the same. (Internally, they boil down to the same thing.) Yet they're
quite different with Perl and the majority of regex engines out there.
(Quiz: the 3rd will match the same as which of the first two?)

|> Regular languages are ones that can be produced using regular operations:
|> empty, letter, concatenation, alternative, and star. The greedy/non-greedy
|> distinction doesn't come into play here at all.

Oh, yes it does! (unless you're talking only calculus theory and not
considering common practice). If they're the same, then so are /.*/ and
/.*?/, and if the target string has any [non-newline] characters they're
guaranteed to be different.

Rahul Dhesi <c.c....@33.usenet.us.com> wrote:
|> Because to implement backtracking you must augment the DFA or NFA with

|> memory, and once you add memory, it's no longer a DFA or NFA.

``memory'' is simply a method of implementation. If you implemented a true
NFA and made the right decision at each step, you'd need no memory. Pretty
similar to a maze. A DFA is an NFA with the right decision guaranteed in
practice, much like a maze with all wrong-paths closed off.

|> >You don't need backtracking if you have NFA in your disposal, and I think
|> >you agreed that NFAs are still regular.
|>
|> Are you claiming that every perl search can be implemented as an NFA?
|> If not, then I fail to see the relevance of the above argument.

I think there are a lot of problems with semantics here (which is, I believe,
the whole reason you brought up this thread :-). In calculus theory,
NFAs and DFAs provide different ways to view the same thing. In computing
*practice*, the terms ``NFA'' and ``DFA'' are not necessiarly strongly
related to their mathematical background. Perhaps ``NFA style'' etc., would
be better.

You're certainly right that many Perl regex constructs can not be
represented with a (calculoulsly-pure) NFA (or DFA, or anything else
regular, since Perl regular expressions aren't regular -- as you well know :-).

But then, that's OK with me -- Perl regular expressions are not calculus,
they're a programming language within the Perl programming language.

Tom C jumped in with:


|> Don't we have anything more important to argue about?

Like how to use the term ``email'' in English? /-:

Rahul Dhesi

unread,
Mar 24, 1997, 3:00:00 AM3/24/97
to

> /^a.*?b.*?c/ # non-greedy, no backtracking, fast, regular expression

To see why /^a.*?b.*?c/ does not require backtracking, implement it as a
finite state machine, as described below. Begin in state 'start'. The
state machine scans its input and makes a state transition on each input
character. Once we have seen an input character and made a state
transition we never need to examine that character again. Hence no
backtracking. EOF represents the end of the input stream:

start: if input is 'b' go to state 'sawb' else if input is EOF
go to state 'fail' else go to (aka remain in) state 'start'.
sawb: if input is 'c' go to state 'done' else if input is EOF
go to state 'fail' else go to state 'sawb'.
done: declare a match.
fail: declare failure.

Note that 'does not require backtracking' means 'at least one
implementation does not require backtracking'.

Bennett Todd

unread,
Mar 24, 1997, 3:00:00 AM3/24/97
to

On 24 Mar 1997 18:23:04 GMT, Rahul Dhesi <c.c....@33.usenet.us.com> wrote:
>To see why /^a.*?b.*?c/ does not require backtracking, implement it as a
>finite state machine, as described below.

> start: if input is 'b' go to state 'sawb' else if input is EOF


> go to state 'fail' else go to (aka remain in) state 'start'.
> sawb: if input is 'c' go to state 'done' else if input is EOF
> go to state 'fail' else go to state 'sawb'.
> done: declare a match.
> fail: declare failure.

Unless I'm missing something, you need another state; perhaps rename "start"
to "sawa" and add

start: if input is 'a' go to state 'sawa' else if input is EOF


go to state 'fail' else go to (aka remain in) state 'start'.

-Bennett

Vladimir Alexiev

unread,
Mar 24, 1997, 3:00:00 AM3/24/97
to

In article <5h2sgm$p...@bug.rahul.net> c.c....@33.usenet.us.com (Rahul Dhesi) writes:

> >You don't need backtracking if you have NFA in your disposal, and I think
> >you agreed that NFAs are still regular.

> Are you claiming that every perl search can be implemented as an NFA?

No. But YOU claimed that every language that requires backtracking (in a
certain IMPLEMENTATION you seem to have in mind) is not regular, which is not
true. And you gave several examples, which all fail to be non-regular.

> If not, then I fail to see the relevance of the above argument.

Perl regexps do not specify regular languages. But not for the reasons you
believe them to.

Vladimir Alexiev

unread,
Mar 24, 1997, 3:00:00 AM3/24/97
to

In article <5h6gq8$g...@samba.rahul.net> c.c....@33.usenet.us.com (Rahul Dhesi) writes:

>
> > /^a.*?b.*?c/ # non-greedy, no backtracking, fast, regular expression
>

> To see why /^a.*?b.*?c/ does not require backtracking, implement it as a
> finite state machine, as described below.

And exactly the same FA works for the greedy ^a.*b.*c

Rahul Dhesi

unread,
Mar 25, 1997, 3:00:00 AM3/25/97
to

In <slrn5jdqd...@onyx.interactive.net> b...@nospam.interactive.net
(Bennett Todd) writes:

>Unless I'm missing something, you need another state; perhaps rename "start"
>to "sawa" and add

> start: if input is 'a' go to state 'sawa' else if input is EOF
> go to state 'fail' else go to (aka remain in) state 'start'.

Yes, you are right. The whole thing, revised, is below.

===
Finite state machine to implement /^a.*?b.*?c/ .

Begin in state 'start'. The state machine scans its input and makes a
state transition on each input character. Once we have seen an input
character and made a state transition we never need to examine that
character again. Hence no backtracking. EOF represents the end of the
input stream:

start: if input is 'a' go to state 'sawa' else go to state 'fail'.
sawa: if input is 'b' go to state 'sawb' else if input is EOF
go to state 'fail' else go to (aka remain in) state 'sawa'.


sawb: if input is 'c' go to state 'done' else if input is EOF
go to state 'fail' else go to state 'sawb'.
done: declare a match.
fail: declare failure.

Rahul Dhesi

unread,
Mar 25, 1997, 3:00:00 AM3/25/97
to

Here is a composite response to many claims that have been made in
this subject thread.

CLAIM: /^a.*b.*c/ (greedy) and /^a.*?b.*?c/ (nongreedy) both match the
same strings.

RESPONSE: Any string that in its entirety is matched by /^a.*b.*c/ is
also matched by /^a.*?b.*c/ . But what if the input stream is not
entirely matched by either? Then, in normal perl code operation, we may
end up matching different substrings of the input stream by the two
patterns. Thus it's incorrect to say that both match the same strings
unless you carefully define just what you mean by 'match' and 'the same
strings'.

COMMENT: Perl and similar pattern matching has one of two goals:
(a) either deciding whether an entire input string is matched by a given
pattern, or (b) searching the input string for some substring that is
matched by a given pattern. Language parsers, on the other hand, care
not much about whether or not we can find a substring that can be parsed
using the rules of the language -- we generally want to parse the entire
input stream. If we find a portion that does not fit our grammar we
declare a parse error.

Therefore blindly extending the concepts of language grammars to pattern
matching will lead to much confusion.

CLAIM: /^a.*?b.*?c/ requires backtracking.
RESPONSE: See finite state machine in another posting.

CLAIM: In /^a.*?b.*?c/ the ``lookahead'' for the first .*? is /b.*?c/,
not simply one character.
RESPONSE: See finite state machine in another posting.

CLAIM: /^a.*?b.*?c/ (nongreedy) and /^a.*b.*c/ (greedy) can be
implemented with exactly the same finite state machine.

RESPONSE: The greedy /^a.*b.*c/ cannot be implemented with a finite
state machine. The first time that we see a 'b' we cannot know if this
is the 'b' we want. Want want the longest possible match for the .*
preceding the 'b'. So we must keep scanning for more occurrences of 'b'
But if we do and find none, then we have lost all the input symbols so
far and cannot make any more decisions based on them. A finite state
machine has no memory of the symbols it has already seen.

CLAIM: /^a.*?b.*?c/ requires backtracking but /^a[^b]*?b[^c]*?c/ does not.
RESPONSE: There is no difference -- the simplest finite state machines
that implement both are identical.

CLAIM: Anything that is generated by a regular language is a regular
expression.
RESPONSE: Only if you remain within the spirit of regular languages and
ask only this question: Whether or not the entire input stream is a
string that belongs to the regular language. As soon as you start
looking for any *substring* in the input stream that is in the regular
language, you are misusing the terminology.

CLAIMS: "How a matcher is implemented (backtracking etc) has no bearing
on the language it recognizes" and "Backtracking is an implementation


feature of the perl pattern matcher (ok, of any reasonable pattern

matcher) that has nothing to do with regularity."

RESPONSE: Certain pattern matches can be implemented without
backtracking and certain others cannot. You can always be sloppy and do
backtracking where it's not needed. But if something absolutely
requires backtracking, then you can't do it without backtracking.

NOW WE CAN SAY IT:

A regular expression is a pattern one for which we can declare a match
or no match, while scanning an input string, without having to rescan
the input string. If we must ever rescan the input string, then the
pattern we are trying to match is not a regular expression.

Almost no perl patterns are regular expressions.

K. Healy

unread,
Mar 25, 1997, 3:00:00 AM3/25/97
to

Jeffrey wrote:
>
> Clay Irving <cl...@panix.com> wrote:
> |> >with history. I suppose you could start a crusade to use a different
> |> >phrase. How about 'whachamacallit'? :-)
> |>
> |> And I highly recommend the second edition of Jeffrey's Book called,
> |> "Mastering Whachamacallits"
>
> Mmmm, somehow this reminds me of the last line of the first paragraph
> on page 71 (of the first edition :-)
^^^^^
Huh? Is there a second already?

Kevin
<kev...@ns.net>

Vladimir Alexiev

unread,
Mar 25, 1997, 3:00:00 AM3/25/97
to

> |> Now the second one needs no backtracking:
> |> find an 'a' or fail;
> The "or fail" is, by definition, backtracking.

Why do you say so? Eg /a/ will fail on "b", and I don't see how it could
possibly backtrack.

> |> > /^a[^b]*?b[^c]*?c/

> |> This one won't need backtracking indeed.
> Sure it does... when the [^b] fails, it can backtrack to the ``gee, because
> of the * I didn't really need to attempt [^b]

When would the [^b] fail? I read teh above thus: check that the first char is
a, then gobble up all ^b's, check that there is a next char (it would be b),
gobble up all ^c's, check for a c.

> in common practice, an ``NFA'' is usually endowed with a much more
> expressive language.

What do you mean by "more expressive in common practice"? If exactly the same
languages (regular ones) are recognized by both NFAs and DFAs, which is more
expressive? You can't make a distinction unless you also consider execution
time etc.

> To a DFA (awk, lex, etc.), /a|and/ and /and|a/ and /a(nd?)/ are *exactly*
> the same.

Ok (though you seem to have a typo in the last one).

> Yet they're quite different with Perl and the majority of regex engines out
> there.

From an implementation point of view? So what? We were talking about the math
term "regular language" and whether perl regexps implement those or more or
less. The first two regexps specify the same language (set of strings) in
either formal langauges or perl (unless the perl implementation is wrong).

> |> The greedy/non-greedy |> distinction doesn't come into play here at all.

> If they're the same, then so are /.*/ and /.*?/, and if the target string
> has any [non-newline] characters they're guaranteed to be different.

They match THE SAME SET OF STRINGS, though they pick up the match differently.
Both specify a regular language.


Vladimir Alexiev

unread,
Mar 25, 1997, 3:00:00 AM3/25/97
to

In article <5h8i9o$f...@samba.rahul.net> c.c....@33.usenet.us.com (Rahul Dhesi) writes:

> CLAIM: /^a.*b.*c/ (greedy) and /^a.*?b.*?c/ (nongreedy) both match the
>

> RESPONSE: what if the input stream is not entirely matched by either? Then,


> in normal perl code operation, we may end up matching different substrings
> of the input stream by the two patterns. Thus it's incorrect to say that
> both match the same strings

It's quite correct to say so: /$re1/ and /$re2/ will return 1 for the same set
of values of $_.

> COMMENT: Perl and similar pattern matching has one of two goals:
> (a) either deciding whether an entire input string is matched by a given
> pattern,

Right. But regular languages only have this goal.

> or (b) searching the input string for some substring that is
> matched by a given pattern.

If you claim that something does not specify a regular language, and if you
use the term in its mathematical sense, you have to stick with (a).

> Therefore blindly extending the concepts of language grammars to pattern
> matching will lead to much confusion.

No, blindly using the term "regular expression" in its math sense without
conforming to what Chomsky and Kleene used it for is what leads to confusion.

> CLAIM: /^a.*?b.*?c/ (nongreedy) and /^a.*b.*c/ (greedy) can be
> implemented with exactly the same finite state machine.
>
> RESPONSE:

Before I consider your response, let me reiterate my argument. Both the above
regexps require a string that starts with a and has a prefix also containing b
and c. Thus your "sawa, sawb" FA works for the second regexp too. It is true
that it would match the "wrong" c (not the last possible one), but this
doesn't matter. Because FA's *cannot specify matches*, it can only say whether
the string matched or not. At least the output-less FAs that are used in
regular language thory.

BTW there's a moot point in the definition of greedy *. How exactly should
this be maximized? Does the first .* take precedence over the second .*? Eg
for the string abbc, is
a .* b .* c a better match than a .* b .* c ?
a b b c a b b c

> The greedy /^a.*b.*c/ cannot be implemented with a finite
> state machine.

Well, how do you expect the FA to report the matches? If you want it to say "I
matched the b from the regexp" as soon as it's seen the correct b, then
obviously it can't do that, because this depends on whether there are more b's
(and c's0 in the input.

> CLAIM: Anything that is generated by a regular language is a regular
> expression.
> RESPONSE: Only if you remain within the spirit of regular languages

But certainly you should remain within this spirit. Otherwise what's the THIRD
sense (neither formal languages nor perl) in which you're using the term?

> As soon as you start looking for any *substring* in the input stream that is
> in the regular language, you are misusing the terminology.

Right ;-) Formal regular languages are not concerned with the way something
matches, only with whether it matches.

> CLAIMS: "How a matcher is implemented (backtracking etc) has no bearing
> on the language it recognizes"

Now an example showing that non-greedy matching may require backtracking:
"abbc" =~ /a.*?b[^b]c/ The .*? will first match "" but then it will have to
reconsider and match the first b.

Rahul Dhesi

unread,
Mar 26, 1997, 3:00:00 AM3/26/97
to

In <5guvee$ikd$1...@csnews.cs.colorado.edu> Tom Christiansen
<tch...@mox.perl.com> writes:

>I'll make a deal: you get everyone to first rename their grep programs
>to gspp, and I'll change the perl docs. Remember that grep allows \1
>stuff in its regular expressions.

But gspp means nothing to newcomers, and neither does grep, so both are
arbitrary terms, and grep is at least easily pronounced. I see no harm
in short arbitrary names.

'Search pattern' has the correct intuitive meaning for new perl users.
'Regular expression' asks the nonmathematical reader to believe that we
are talking about a special type of expression, one that is "regular" in
some sense, which is not really true. And 'regular expression' invites
the reader with a mathematical background to be misled into thinking
that such an expression might be implemented with a finite state
machine, which is generally not true either.

So I propose the following to all who read this:

In the future, whenever you have a choice, say or write 'pattern'
or 'search pattern' or 'perl pattern' as appropriate instead of
'regular expression'. It will be easier for your nontechnical
audience to understand what you mean, it will mislead fewer people,
and you even have to type fewer characters!

It looks like a win-win situation to me.

Jeff Stampes

unread,
Mar 27, 1997, 3:00:00 AM3/27/97
to

Tom Christiansen (tch...@mox.perl.com) wrote:
: Don't we have anything more important to argue about?

Gee, you mean like the variations in spelling and use surrounding the
word 'e-mail'?

--
Jeff Stampes -- Xilinx, Inc. -- Boulder, CO -- jeff.s...@xilinx.com

Rahul Dhesi

unread,
Mar 28, 1997, 3:00:00 AM3/28/97
to

In <omohc7v...@tees.cs.ualberta.ca> Vladimir Alexiev
<vlad...@cs.ualberta.ca> writes:

>Now an example showing that non-greedy matching may require backtracking:
>"abbc" =~ /a.*?b[^b]c/ The .*? will first match "" but then it will have to
>reconsider and match the first b.

Note that "abbc" will never match /a.*?b[^b]c/.

But anyway:

If you make this anchored, then backtracking is not required. Let's
implement a state machine for /^a.*?b[^b]c/ . An implicit fail on EOF
is assumed unless we have already declard success.

start: if 'a' goto dotstar else fail.
dotstar: if 'b' goto sawb else remain in dotstar.
sawb: if 'b' goto sawb else goto wantc.
wantc: if c declare success else if b goto sawb else goto dotstar.

Vladimir Alexiev

unread,
Mar 31, 1997, 3:00:00 AM3/31/97
to c.c....@33.usenet.us.com

In article <5hhc4p$s...@samba.rahul.net> c.c....@33.usenet.us.com (Rahul Dhesi) writes:

> >Now an example showing that non-greedy matching may require backtracking:
> >"abbc" =~ /a.*?b[^b]c/

> Note that "abbc" will never match /a.*?b[^b]c/.

Sorry about that. What I meant was /a.*?b[^b]*?c/ (I came up with this through
minimal modification of your original example). Now THIS requires
backtracking. The idea is that a later part of the regexp makes an early empty
match on .*? to fail.

However, I need to qualify the above statement, because it *depends on the
implementation*. I should say "needs backtracking in the most obvious
implementation". In fact, below is an implementation that doesn't require
backtracking. It illustrates the methodology of transforming a NFA to a DFA.

Start with the most obvious NFA that implements /a.*?b[^b]*?c/. It can be
generated automatically from the regexp by composing single-state FAs.
(I assume the alphabet is just a,b,c.)
input a b c
state 0 1 - - The table entries are the new states.
1 1 1,2 1 1,2 and 2,3 designate nondeterminism (either state).
2 2 - 2,3

Now we have to complete this table with the new pseudostates 1,2 and 2,3:
input a b c
state 1,2 1,2 1,2 1,2,3
2,3 2 - 2,3
1,2,3 1,2 1,2 1,2
Yet another pesudostate 1,2,3 appears, and we complete with it too.

Now start again from 0 and see which states are reachable. Some of the new
states subsume old states and thus make them useless.
input a b c
state 0 1 - -
1 1 1,2 1
1,2 1,2 1,2 1,2,3
1,2,3 1,2 1,2 1,2
The final state is 1,2,3 because it contains the original final state 3.

Or if we give some more symbolic names to our states:
input a b c
state start sawa - -
sawa sawa sawb sawa
sawb sawb sawb sawc
sawc sawb sawb sawb
Where sawc is the final success state (there's transitions going "back" from
it, but that's legit).

The important point here is that we *can't talk* about the non-greediness
condition here: none of the states corresponds precisely to the .*? part in
the original regexp, so the DFA can't report when it has matched it. Indeed,
for the string "aabc" the .*? matches "a" and ends up in the sawa state, while
for "abbc" it matches "b" in the sawb state.

In fact it's impossible to build a DFA where a specific state corresponds to
.*? Consider the regexp again: /a.*?b[^b]*?c/ matches any string containing
a,b,c (in that order), and the b in the regexp is forced to match the last b
of the string (there can be no b's after it). If the state q of a DFA would
correspond to the .*?, then a b transition from it would lead us to a state r
that would recognize the last b in the string. But the DFA can't know it has
reached the last b before it has seen all of the string. Therefore there
cannot be any transitions out of r, but there must be a c transition out of
it.

Reply all
Reply to author
Forward
0 new messages