I have been wondering, with limited success, how to rewrite a regexp
without word boundaries.
According to http://www.perl.com/doc/manual/html/pod/perlre.html:
<from Perl manual>
In addition, Perl defines the following:
\w Match a "word" character (alphanumeric plus "_")
\W Match a non-word character
[...]
Perl defines the following zero-width assertions:
\b Match a word boundary
[...]
A word boundary (\b) is defined as a spot between two characters
that has a \w on one side of it and a \W on the other side of it (in
either order), counting the imaginary characters off the beginning and
end of the string as matching a \W.
</from Perl manual>
Thus, regexp "ex" matches "example" and "text", whereas regexp "\bex"
matches "example" but not "text".
Now, if "\b" occurs at the beginning (or end) of the regexp, I think
it's easy to rewrite the regexp without using "\b". For example,
"\bex" could be rewritten as "\Wex".
But what if "\b" occurs within the regexp? For example, how to get rid
of "\b" in "<RE>\bex" (with "<RE>" being any regexp)? "<RE>\Wex"
wouldn't work here: for example (with "<RE>" = "\W"), "\W\Wex" is not
equivalent to "\W\bex".
So, is there an algorithm that transforms any regexp into an
equivalent regexp that contains no "\b"?
NOTE: In this message, by "regexp" I mean regular expressions in their
traditional meaning (= empty set, empty string, single symbol, union,
concatenation, Kleene star, and nothing else).
-- dave
> I have been wondering, with limited success, how to rewrite a regexp
> without word boundaries.
Why do you want to? Most likely, the answer is that your regexps are
getting too clever and thus too unreadable/bug-prone, so you should
break them up and use more ordinary programming instead.
However:
> (...)
> Thus, regexp "ex" matches "example" and "text", whereas regexp "\bex"
> matches "example" but not "text".
>
> Now, if "\b" occurs at the beginning (or end) of the regexp, I think
> it's easy to rewrite the regexp without using "\b". For example,
> "\bex" could be rewritten as "\Wex".
No, that would not match "ex" at the beginning of the string being
matched. You could use (?:^|\W)ex. But the matched substring (Perl's
$&) will differ from with \bex, so it depends on how the regexp is used.
> But what if "\b" occurs within the regexp? For example, how to get rid
> of "\b" in "<RE>\bex" (with "<RE>" being any regexp)? "<RE>\Wex"
> wouldn't work here: for example (with "<RE>" = "\W"), "\W\Wex" is not
> equivalent to "\W\bex".
Rewrite <RE> to a regexp which ends with \W or \W|^ and is equivalent
in the cases where it is followed by \b\w.
--
Hallvard
The regexps are not mine... Sorry, I should have explained. I am
actually writing a tool that takes regexps as input and transforms
them internally into NFAs/DFAs. Since the regexps are not really in my
hands, I should be ready for weird regexps - for example, regexps with
"\b" preceded or followed by other regexps. And I don't know how to
transform a regexp that contains "\b" at an arbitrary position into an
equivalent NFA/DFA.
> > Now, if "\b" occurs at the beginning (or end) of the regexp, I think
> > it's easy to rewrite the regexp without using "\b". For example,
> > "\bex" could be rewritten as "\Wex".
>
> No, that would not match "ex" at the beginning of the string being
> matched. You could use (?:^|\W)ex.
You are right. My mistake.
> But the matched substring (Perl's
> $&) will differ from with \bex, so it depends on how the regexp is used.
I am not interested in the matched substring, so "(?:^|\W)ex" (or
rather the NFA/DFA corresponding to "(?:^|\W)ex") is fine... in this
particular case.
> > But what if "\b" occurs within the regexp? For example, how to get rid
> > of "\b" in "<RE>\bex" (with "<RE>" being any regexp)? "<RE>\Wex"
> > wouldn't work here: for example (with "<RE>" = "\W"), "\W\Wex" is not
> > equivalent to "\W\bex".
>
> Rewrite <RE> to a regexp which ends with \W or \W|^ and is equivalent
> in the cases where it is followed by \b\w.
Hmm, not so simple in the general case, but I will have to think about
this possibility.
-- dave
> ...I am
> actually writing a tool that takes regexps as input and transforms
> them internally into NFAs/DFAs. ... I don't know how to
> transform a regexp that contains "\b" at an arbitrary position into an
> equivalent NFA/DFA.
I don't think that is possible: that is an additional mechanism on top
of the reg-ex structure.
Formally, let C be the character set, and C* the set of strings (the
free monoid on the set C). Then a language L is a subset of C*. The
regular expressions are operations on subsets of C* that define
languages. If L is a regular expression language, then a string either
matches - it is a subset of L - or it does not.
Now, when used in a regular expression matcher program, one selects a
parsing point in the string and looks at the set of strings of
successive lengths from that point. Usually, the match is the longest of
those strings that are in L. But there is a choice, which does not
depend on L.
So I think you need to implement a mechanism without the NFA/DFA
structure that identifies the \b boundaries.
Hans
It's been way too long since I played with NFAs/DFAs, but anyway:
Since you are saying NFA, maybe this a somewhat theoretical exercise
so it's OK to produce one with a godawful number of states. Is the
following possible?
Make a \b-tracking DFA, one which only notices word boundaries.
Make an NFA for your regexp without '\b's, but rewrite it so any jump
from a \b in the regexp is the only jump into the destination state.
Join the two machines (combining all possible states from the two) so
they in effect run in parallel in a big machine. Give \b as additional
input from the 1st submachine to the "coming from \b" states in the 2nd.
--
Hallvard
Why don't you study Perl's regex engine to see how they implement it?
It is all open source. I did this a while ago. It is very
interesting to look at, Perl has the most advanced and heavily used
regex engine out of just about anything.
Also one thing to note is that in formal language theory "regular
expression" has a well-defined meaning. See Chomsky. Perl's regular
expressions do *not* classify as regular expressions under the formal
definition.
-Andrew.
[Perl's regex engine is swell, but if performance is an issue it's nowhere
near as fast as a DFA generated by flex or re2c. -John]
However, if I were attacking this problem, and I might have reason to
over the next year (so I would be glad to correspond further on it), I
would do some case-analysis to prune off easy cases. So, we'll call
our original RE A\bB, such that A & B are both regular expressions.
If your analysis of A or B can tell you what characters A can end with
or B can begin with (and if you are making an FA, you should be able
to get that information), since there has to be a set of states that
correspond to where the \b is to be matched. The edges into and out of
that state will be labelled with the valid characters upon which the
transition can occur. If the ones leading in have characters in the
\w set, the ones leading out must have characters in the \W set and
vice-versa. Note, if one or the other side has characters from both
sets \w and \W, split the state into 2 states, with the \W characters
leading to 1 state and the \w characters leading to the other. I'll
leave the rest of this idea as an exercise to the reader. I'm sure
there are some thorny details that I haven't worked out.
Hope this helps,
-Chris
******************************************************************************
Chris Clark Internet: christoph...@compiler-resources.com
Compiler Resources, Inc. or: com...@world.std.com
23 Bailey Rd Web Site: http://world.std.com/~compres
Berlin, MA 01503 voice: (508) 435-5016
USA fax: (978) 838-0263 (24 hours)
------------------------------------------------------------------------------
Andrew Tomazos <and...@tomazos.com> wrote:
>[...]
>Also one thing to note is that in formal language theory "regular
>expression" has a well-defined meaning. See Chomsky. Perl's regular
>expressions do *not* classify as regular expressions under the formal
>definition.
> -Andrew.
>[Perl's regex engine is swell, but if performance is an issue it's nowhere
>near as fast as a DFA generated by flex or re2c. -John]
IIRC perl's regex implementation (as well as glibc, btw) even lose
compared to *NFA emulation*, for example with a?^na^n, matched against
a^n (where "^n" is manually expanded to mean n times, i.e. with n=3
it'd be the regular expression a?a?a?aaa against string aaa). Java
loses, too (been there, tested it). OpenBSD's libc doesn't (seems to
be plain vanilla NFA emulation when it sees the expression is in fact
regular; yields the expected, roughly quadratic runtime behaviour).
See http://swtch.com/~rsc/regexp/regexp1.html for information about that
problem.
I also tested the same thing with flex, using a small script that
generates flex input and a test program for a given n. Horrendous
turnaround times (flex generation mostly) for increasing n, blazingly
fast match times, both as expected.
Kind regards,
Hannah.
[Spamassassin has an option to compile its patterns with re2c and link that into
perl. It takes the better part of an hour to compile, but it runs really fast. -John]
On my 2Ghz Xeon E5335 server it takes 15 seconds to download an updated
ruleset and compile it using re2c.
Regarding the performance of parsing engines, the PEG formalism is
nice in that it gives the programmer much more straightforward control
over backtracking than perl-style regexes and nuch more predictable
performance, while allowing small and fast implementations (e.g. LPEG).
Tony.
--
f.anthony.n.finch <d...@dotat.at> http://dotat.at/
[I timed it, it's not an hour, on my server it takes about 15 seconds to run re2c to create the C
source, but iover 15 minutes to compile and link that. Still hugely worth it for spamassassin which
runs several times per minute. -John]