Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to get the "longest possible" match with Python's RE module?

19 views
Skip to first unread message

Licheng Fang

unread,
Sep 11, 2006, 11:04:58 PM9/11/06
to
Basically, the problem is this:

>>> p = re.compile("do|dolittle")
>>> p.match("dolittle").group()
'do'

Python's NFA regexp engine trys only the first option, and happily
rests on that. There's another example:

>>> p = re.compile("one(self)?(selfsufficient)?")
>>> p.match("oneselfsufficient").group()
'oneself'

The Python regular expression engine doesn't exaust all the
possibilities, but in my application I hope to get the longest possible
match, starting from a given point.

Is there a way to do this in Python?

MonkeeSage

unread,
Sep 11, 2006, 11:26:43 PM9/11/06
to
Licheng Fang wrote:
> Basically, the problem is this:
>
> >>> p = re.compile("do|dolittle")
> >>> p.match("dolittle").group()
> 'do'

>From what I understand, this isn't python specific, it is the expected
behavior of that pattern in any implementation. You are using
alternation, which means "either, or", and you have the shorter
subexpression first, so the condition is satisfied by just 'do' and the
matching terminates.

> There's another example:
>
> >>> p = re.compile("one(self)?(selfsufficient)?")
> >>> p.match("oneselfsufficient").group()
> 'oneself'

Again, I don't think this has anything to do with python. You pattern
basically means "match 'one' whether it is followed by 'self' or not,
and whether it is followed by 'selfsufficient' or not". For this
particular example, you'd want something like
"one(self)?(sufficient)?".

I think you could construct a pattern that would do what you want in
python without any problem. If you post a (short) example of your data,
I'm sure someone could help you with it.

Regards,
Jordan

kondal

unread,
Sep 12, 2006, 12:05:46 AM9/12/06
to

This is the way the regexp works python doesn't has anything to do with
it. It starts parsing the data with the pattern given. It returns the
matched string acording the pattern and doesn't go back to find the
other combinations.

So to get all the combinations you would probably require to give
different patterns each time.

Licheng Fang

unread,
Sep 12, 2006, 12:51:03 AM9/12/06
to

Hi, according to these regexp engine discussions, it's NOT a behavior
true to any implementation.

http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/RegularExpressionEngines.html
http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/RegularExpressionEngines.html

Python's NFA engine reads along the input string, matching it to the
pattern, and backtracking when needed. By contrast a DFA engine, to my
understanding, constructs a DFA and uses it to munch as many characters
as possible. Maybe it's like this:

Pattern: one(self)?(selfsufficient)?

PYTHON'S NFA ENGINE:

one self, none selfsufficient, none
(start)------->((1))------------>((2))----------------------->((3))

DFA ENGINE:

one self
(start)------->((123))------------>((23))
|
|
| selfsufficient
--------------->((3))

I want to know if there is some way to make Python RE behave like grep
does, or do I have to change to another engine?

Paddy

unread,
Sep 12, 2006, 1:01:23 AM9/12/06
to

>
> I want to know if there is some way to make Python RE behave like grep
> does, or do I have to change to another engine?

Maybe if you posted a (tested) grep example and data, that does as you
want, the group could better understand what you are asking for?

- Paddy.

Tim Peters

unread,
Sep 12, 2006, 1:16:27 AM9/12/06
to pytho...@python.org
[Licheng Fang]
> ...

> I want to know if there is some way to make Python RE behave like grep
> does,

Not in general, no. The matching strategies couldn't be more
different, and that's both deep and intentional. See Friedl's book
for details:

http://regex.info/

> or do I have to change to another engine?

Yes, if POSIX regexp semantics are what you require. Several years
ago there was an extension module for Python supplying POSIX
semantics, but I couldn't find anything current just now in a minute
of searching. You may be more motivated to search longer ;-)

MonkeeSage

unread,
Sep 12, 2006, 1:19:05 AM9/12/06
to
Licheng Fang wrote:
> Hi, according to these regexp engine discussions, it's NOT a behavior
> true to any implementation.
> [snip]

Well, I just double-checked in ruby (oniguruma regexp engine):

r = Regexp.new("do|dolittle")
puts r.match("dolittle")[0]
# do

r = Regexp.new("one(self)?(sufficient)?")
puts r.match("oneselfsufficient")[0]
# oneself

And perl:

if ("doolittle" =~
/(do|dolittle)/) {
print "$1\n";
# do
}

if ("oneselfsufficient" =~
/(one(self)?(selfsufficient)?)/) {
print "$1\n";
# oneself
}

And Javascript (whatever regexp engine Spidermonkey uses):

var r = new RegExp(/do|dolittle/);
alert("dolittle".match(r)[0]);

var r = new RegExp(/one(self)?(selfsufficient)?/);
alert("oneselfsufficient".match(r)[0]);

So, it seems they are all broken, or python is correct as well.

Regards,
Jordan

MonkeeSage

unread,
Sep 12, 2006, 1:50:45 AM9/12/06
to
MonkeeSage wrote:
> So, it seems they are all broken, or python is correct as well.

Aha, sorry about that Licheng (re: Tim's post). I guess "broken"
depends on if you are expecting perl-compatible behavior or otherwise.
I have my own scripts I use to do (f)grep and sed-like operations, so I
almost never use those programs and forgot about the different pattern
semantics (part of the reason I made my own scripts).

Regards,
Jordan

Licheng Fang

unread,
Sep 12, 2006, 1:51:38 AM9/12/06
to
Oh, please do have a look at the second link I've posted. There's a
table comparing the regexp engines. The engines you've tested probably
all use an NFA implementation.

MonkeeSage

unread,
Sep 12, 2006, 2:04:48 AM9/12/06
to
Licheng Fang wrote:
> Oh, please do have a look at the second link I've posted. There's a
> table comparing the regexp engines. The engines you've tested probably
> all use an NFA implementation.

Sorry! *blush* I admit I skipped over your links. I'll have a look now.

BTW, just an idea that may or may not work. What about finding all
matches that meet the absolute baseline pattern, then taking the
longest of them...something like this mabye:

def matcher(strings, pattern):
out = ''
reg = re.compile(pattern)
for string in strings:
match = reg.match(string).group()
if (len(match) >= len(out)): # should this use > or >= ?
out = match
return out # empty is no matches, else longest match

p = ['dodad', 'dolittle', 'dodaday']
print matcher(p, r'do.*')
# dolittle

Just a thought...

Regards,
Jordan

Licheng Fang

unread,
Sep 12, 2006, 2:18:28 AM9/12/06
to
Thank you very much, Tim and Monkee.

In fact, what I'm doing is handle a lot of regular expressions. I
wanted to build VERY LONG regexps part by part and put them all into a
file for easy modification and maintenance. The idea is like this:

(*INT) = \d+
(*DECIMAL) = (*INT)\.(*INT)
(*FACTION) = (*DECIMAL)/(*DECIMAL)
(*NUMERALS) = (*FACTION)|(*DECIMAL)|(*INT)
... ...

What's inside the sytactically wrong (* and ) is something to be
replaced, and then I wrote a little script to do the string
substitution, to get very long regexps to be compiled. I thought that
might be a way to handle long and complex regexps, and in this course I
encountered the problem with the semantics of '|'.

My approach may sound desperate and funny, but please, do you have any
good idea as to how to handle long and complex regexps?

MonkeeSage

unread,
Sep 12, 2006, 2:18:46 AM9/12/06
to
Or mabye something like this is better:

def matcher(string, pattern):
out = ''
for match in re.findall(r'\S*%s\S*' % pattern, string):
if (len(match) >= len(out)):


out = match
return out

p1 = 'dodad donkeykong dolittle dodaday'
p2 = 'oneself self-serving selfsufficient oneselfsufficient'
print matcher(p1, 'do')
# donkeykong
print matcher(p2, 'self')
# oneselfsufficient

Bryan Olson

unread,
Sep 12, 2006, 2:51:51 AM9/12/06
to
Licheng Fang wrote:
> Oh, please do have a look at the second link I've posted. There's a
> table comparing the regexp engines. The engines you've tested probably
> all use an NFA implementation.

Unfortunately, the stuff about NFA's is wrong. Friedl's awful
book was the first time I saw this confusion about what NFA is;
I don't know if he originated the mess or just propagated it.

"Nondeterministic finite automata" is well defined in theory
of computation. The set of languages accepted by NFA's is
exactly the same as the set accepted by DFA's.

What Python uses is search-and-backtrack. Unfortunately such
engines don't have much theory behind them, and it's hard to
reason generally about what they do.


--
--Bryan

Tim Peters

unread,
Sep 12, 2006, 2:52:41 AM9/12/06
to pytho...@python.org
[Licheng Fang[
> ...

My best advice is to find a different approach entirely. For example,
build a parser using parser technology, and use regexps in that /only/
to do gross tokenization ("string of digits", "decimal point", ...):
build the rest with a grammar.

Regexps are a brittle tool, best tolerated in small doses. For an
"NFA" implementation like Python's, you're likely to experience poor
speed when combining many complex regexps, and /especially/ when
alternatives are ambiguous wrt prefixes (and yours are, else you
wouldn't have a problem with "longest match" versus "some shorter
match" to begin with. OTOH, under a "DFA" implementation such as
POSIX grep's, you're likely to experience exponential memory
requirements (DFA implementations can need to build enormous state
machines, tracing out in advance all possible paths through all the
regexps applied to all possible input strings).

Just sounds to me like the wrong tool for the job.

Tim Peters

unread,
Sep 12, 2006, 3:12:51 AM9/12/06
to pytho...@python.org
[Licheng Fang]

>> Oh, please do have a look at the second link I've posted. There's a
>> table comparing the regexp engines. The engines you've tested probably
>> all use an NFA implementation.

[Bryan Olson]


> Unfortunately, the stuff about NFA's is wrong. Friedl's awful
> book

Strongly disagree: it's an excellent book about the /pragmatics/ of
using "regular expressions" as most widely implemented. It's not at
all about "regular expressions" in the CompSci sense of the term,
which appears to be your complaint.

> was the first time I saw this confusion about what NFA is;
> I don't know if he originated the mess or just propagated it.

As far as I could tell at the time, he originated it. I'm not happy
about that either.

> "Nondeterministic finite automata" is well defined in theory
> of computation. The set of languages accepted by NFA's is
> exactly the same as the set accepted by DFA's.

And which has very little to do with "regular expressions" as most
widely implemented -- gimmicks like backreferences are wholly outside
the DFA formalism.

> What Python uses is search-and-backtrack. Unfortunately such
> engines don't have much theory behind them, and it's hard to
> reason generally about what they do.

Yup X 3, and the last is precisely why Friedl's book is valuable for
people using "NFA" implementations: Friedl does a good job of
explaining when and why you're likely to trigger atrocious runtime
performance, and gives practical general techniques for avoiding those
problems. None of that has anything to do with CompSci regexps
either, but is vital knowledge for people hoping to make happy
non-trivial use of Python/Perl/etc regexps.

ga...@dsdata.it

unread,
Sep 12, 2006, 4:36:40 AM9/12/06
to

kondal wrote:

> This is the way the regexp works python doesn't has anything to do with
> it. It starts parsing the data with the pattern given. It returns the
> matched string acording the pattern and doesn't go back to find the
> other combinations.

I've recently had the same problem in Java, using automatically
generated regular expressions to find the longest match; I failed on
cases like matching the whole of "Abcdefg", but also the whole of
"AbCdefg" or "ABcdefg", with ([A-Z][a-z])?([A-Z][A-Za-z]{1,10})? .
No systematic way to deal with these corner cases was available, and
unsystematic ways (with greedy and reluctant quantifiers) were too
complex.
I ended up eliminating regular expressions completely and building a
dynamic programming parser that returns the set of all match lengths;
it wasn't hard and it should be even easier in Python.

Lorenzo Gatti

Licheng Fang

unread,
Sep 12, 2006, 5:08:50 AM9/12/06
to

Thanks for the valuable information. Indeed, when I read the pages, I
was a little confused about what it meant by 'NFA'.

But I faintly felt, there might be an indirect, and not not very exact
mapping from the search-and-backtrack strategy to NFAs in the sense of
computer science, e.g. a state machine with the capability to be in
several states at one time.

Say, when reading 'oneselfsufficient', the engine goes along the NFA
first to state 1, and then faces the choice between

one self, none selfsufficient, none
(start)------->((1))------------>((2))----------------------->((3))

1) matching 'self',
2) going on to state 2 without matching anything, or
3) just give 'one' as the matching result because state 1 is already a
terminal state

In such situations it always chooses the greedy way. To match more, it
goes to the state 2, munching 'self'. And now it's left with only
'sufficient'. Here it's choices are:

1) matching nothing and going to state 3
2) just give 'oneself' as result because state 2 is also a terminal
state

Again it's greedy, going on to state 3, in hope of matching more. But
here the pattern comes to an end, represented by state 3 as a terminal
state. So the engine gives 'oneself' as result and forgets about its
previously unexplored possibilities, because it only performs backtrack
when a match cannot be found.

I think if the backtrack is carried out in an exaustive way, we may say
the engine trys every possibility on the NFA, though it's not an NFA
itself.

Licheng Fang

unread,
Sep 12, 2006, 5:46:07 AM9/12/06
to

Thanks. I think make use of the expresiveness of CFG may be better
idea.

Another question: my task is to find in a given string the substrings
that satisfies a particular pattern. That's why the first tool that
came to my mind is regular expression. Parsers, however, only give a
yes/no answer to a given string. To find all substrings with a
particular pattern I may have to try every substring, which may be an
impossible task.

How can I solve this problem?

Paul Rubin

unread,
Sep 12, 2006, 5:50:46 AM9/12/06
to
"Licheng Fang" <fangl...@gmail.com> writes:
> I think if the backtrack is carried out in an exaustive way, we may say
> the engine trys every possibility on the NFA, though it's not an NFA
> itself.

The backtracking engine really can recognize languages that are not
describable by classical regexps, by using backreferences, negation,
etc. But exactly what it can do is nowhere near as well-understood as
what classical regexps can do.

I seem to remember that the fully general problem of recognizing
regexps with negation is very hard, so the backtracking matcher either
has to reject some strings it should really match, or else it has to
bog down horribly with certain kinds of patterns.

ga...@dsdata.it

unread,
Sep 12, 2006, 6:06:05 AM9/12/06
to

Licheng Fang wrote:

> Another question: my task is to find in a given string the substrings
> that satisfies a particular pattern. That's why the first tool that
> came to my mind is regular expression. Parsers, however, only give a
> yes/no answer to a given string. To find all substrings with a
> particular pattern I may have to try every substring, which may be an
> impossible task.

You can collect all successful parser results beginning from each index
in the string; this gives you all matches with that first index.
You could extend to multiple results general bottom-up context-free
language parsing like Earley or Tomita's algorithms; for reasonable
languages most locations can be excluded for most rules at the
beginning, with great performance improvements over trying over and
over again.

Lorenzo Gatti

Fredrik Lundh

unread,
Sep 12, 2006, 6:23:14 AM9/12/06
to pytho...@python.org
Licheng Fang wrote:

The idea is like this:
>
> (*INT) = \d+
> (*DECIMAL) = (*INT)\.(*INT)
> (*FACTION) = (*DECIMAL)/(*DECIMAL)
> (*NUMERALS) = (*FACTION)|(*DECIMAL)|(*INT)
> ... ...
>
> What's inside the sytactically wrong (* and ) is something to be
> replaced, and then I wrote a little script to do the string
> substitution

what's hiding under those ellipses? things like "TERM" and "FACTOR" and
"EXPRESSION", perhaps?

</F>

Frederic Rentsch

unread,
Sep 12, 2006, 12:27:41 PM9/12/06
to pytho...@python.org
Licheng,

If you need regexes, why not just reverse-sort your expressions? This
seems a lot easier and faster than writing another regex compiler.
Reverse-sorting places the longer ones ahead of the shorter ones.

>>> targets = ['be', 'bee', 'been', 'being']
>>> targets.sort ()
>>> targets.reverse ()
>>> regex = '|'.join (targets)
>>> re.findall (regex, 'Having been a bee in a former life, I don\'t
mind being what I am and wouldn\'t want to be a bee ever again.')
['been', 'bee', 'being', 'be', 'bee']

You might also take a look at a stream editor I recently came out with:
http://cheeseshop.python.org/pypi/SE/2.2%20beta

It has been well received, especially by newbies, I believe because it
is so simple to use and allows very compact coding.

>>> import SE
>>> Bee_Editor = SE.SE ('be=BE bee=BEE been=BEEN being=BEING')
>>> Bee_Editor ('Having been a bee in a former life, I don\'t mind
being what I am and wouldn\'t want to be a bee ever again.')

"Having BEEN a BEE in a former life, I don't mind BEING what I am and wouldn't want to BE a BEE ever again."

Because SE works by precedence on length, the targets can be defined in any order and modular theme sets can be spliced freely to form supersets.


>>> SE.SE ('<EAT> be==, bee==, been==, being==,')(above_sting)
'been,bee,being,be,bee,'

You can do extraction filters, deletion filters, substitutitons in any combination. It does multiple passes and can takes files as input, instead of strings and can output files.

>>> Key_Word_Translator = SE.SE ('''
"*INT=int substitute"
"*DECIMAL=decimal substitute"
"*FACTION=faction substitute"
"*NUMERALS=numerals substitute"
# ... etc.
''')

I don't know if that could serve.

Regards

Frederic


Bryan Olson

unread,
Sep 12, 2006, 8:51:48 PM9/12/06
to
Tim Peters wrote:

> [Bryan Olson]
>> Unfortunately, the stuff about NFA's is wrong. Friedl's awful
>> book
>

> Strongly disagree: [...]

I know I'm disagreeing with a lot of smart people in panning
the book.

>> What Python uses is search-and-backtrack. Unfortunately such
>> engines don't have much theory behind them, and it's hard to
>> reason generally about what they do.
>
> Yup X 3, and the last is precisely why Friedl's book is valuable for
> people using "NFA" implementations: Friedl does a good job of
> explaining when and why you're likely to trigger atrocious runtime
> performance, and gives practical general techniques for avoiding those
> problems. None of that has anything to do with CompSci regexps
> either, but is vital knowledge for people hoping to make happy
> non-trivial use of Python/Perl/etc regexps.

I read the first edition, minus some engine-specific stuff.
General techniques are what I want, and I didn't see them in the
book. He had things to do and not do, but it didn't add up to
building (ir-)regular expressions with reliably tractable
behavior.


--
--Bryan

Bryan Olson

unread,
Sep 12, 2006, 9:29:02 PM9/12/06
to

Yes. Here's a way, but it sucks real bad:


def longest_match(re_string, text):
regexp = re.compile('(?:' + re_string + ')$')
while text:
m = regexp.match(text)
if m:
return m
text = text[:-1]
return None


--
--Bryan

Tim Peters

unread,
Sep 12, 2006, 10:16:43 PM9/12/06
to pytho...@python.org
[Licheng Fang]

>> Basically, the problem is this:
>>
>> >>> p = re.compile("do|dolittle")
>> >>> p.match("dolittle").group()
>> 'do'

...

>> The Python regular expression engine doesn't exaust all the
>> possibilities, but in my application I hope to get the longest possible
>> match, starting from a given point.
>>
>> Is there a way to do this in Python?

[Bryan Olson]


> Yes. Here's a way, but it sucks real bad:
>
>
> def longest_match(re_string, text):
> regexp = re.compile('(?:' + re_string + ')$')
> while text:
> m = regexp.match(text)
> if m:
> return m
> text = text[:-1]
> return None

If you want to try something like that, note that the match() method
accepts optional slice arguments, so the "text = text[:-1]" business
can be replaced with much quicker little-integer arithmetic.

Tim Peters

unread,
Sep 12, 2006, 11:09:00 PM9/12/06
to pytho...@python.org
[Bryan Olson]
>>> Unfortunately, the stuff about NFA's is wrong. Friedl's awful
>>> book

[Tim Peters]
>> Strongly disagree: [...]

[Bryan]


> I know I'm disagreeing with a lot of smart people in panning
> the book.

That's allowed :-)

>>> What Python uses is search-and-backtrack. Unfortunately such
>>> engines don't have much theory behind them, and it's hard to
>>> reason generally about what they do.

>> Yup X 3, and the last is precisely why Friedl's book is valuable for
>> people using "NFA" implementations: Friedl does a good job of
>> explaining when and why you're likely to trigger atrocious runtime
>> performance, and gives practical general techniques for avoiding those
>> problems. None of that has anything to do with CompSci regexps
>> either, but is vital knowledge for people hoping to make happy
>> non-trivial use of Python/Perl/etc regexps.

> I read the first edition, minus some engine-specific stuff.
> General techniques are what I want, and I didn't see them in the
> book. He had things to do and not do, but it didn't add up to
> building (ir-)regular expressions with reliably tractable
> behavior.

My guess then is that the book moved at too slow a pace for you, and
you got bored. The most valuable general technique he (eventually ;-)
explained he called "unrolling", and consists of writing a regexp in
the form:

normal* (?: special normal* )*

where the sets of characters with which `normal` and `special` can
start are disjoint. Under most "NFA" implementation, such a regexp is
immune to most bad behaviors, whether finding very long matches, or in
searching a very long string that happens not to contain any matches.

Without knowing this, under many implementations it's very easy to end
up writing a regexp that matches quickly when a short match exists,
but "blows up" (stack fault, or some checked internal limit hit) if
only a very long match exists; and/or takes even exponential time to
fail to match when a match doesn't exist.

For example, a direct application is writing a robust regular
expression to match C string literals (also one form of Python string
literal: double-quote character at each end, and backslash escapes,
including backslash-newline to span lines):

" [^"\\\n]* (?: \\[\x00-\xff] [^"\\\n]* )* "

The "normal case" is that it's just a regular old character: not a
double quote, backslash, or newline. The "special case" is a
backslash followed by any character whatsoever. The "normal" and
"special" cases obviously don't intersect, so the interior of this
regexp (i.e., ignoring the double quotes at each end) fits the
"unrolling" pattern to a tee.

As a result, you can use it with reasonably high confidence. For
/why/ that's so, read the rest of his book ;-) In effect, it gives
the search engine only one possible course of action at each character
in the target string. So long as that's "normal", it has to stay in
the "normal" part of the regexp, and as soon as it's "special" it has
to leave the "normal" part. The intent is to eliminate any
possibility for backtracking, thus ensuring predictable runtime and
ensuring that no form of internal backtracking stack needs to be used
(let alone hit its limit).

The unrolling idea was new to me, and as soon as I learned it I
replaced lots of regexps in the Python library with rewritten versions
that demonstrably performed very much better -- even allowed closing
some bug reports concerning extreme cases (extremely long matches, and
unbearable runtime on long strings without any matches).

A different lesson to take from this is that nobody is likely to
stumble into effective "NFA" techniques based on luck or "common
sense". It's a set of obscure & specialized learned skills.

Bryan Olson

unread,
Sep 13, 2006, 12:12:16 AM9/13/06
to
Tim Peters wrote:
> [...] The most valuable general technique he (eventually ;-)

> explained he called "unrolling", and consists of writing a regexp in
> the form:
>
> normal* (?: special normal* )*
>
> where the sets of characters with which `normal` and `special` can
> start are disjoint. Under most "NFA" implementation, such a regexp is
> immune to most bad behaviors, whether finding very long matches, or in
> searching a very long string that happens not to contain any matches.

[...]


> As a result, you can use it with reasonably high confidence. For
> /why/ that's so, read the rest of his book ;-) In effect, it gives
> the search engine only one possible course of action at each character
> in the target string. So long as that's "normal", it has to stay in
> the "normal" part of the regexp, and as soon as it's "special" it has
> to leave the "normal" part. The intent is to eliminate any
> possibility for backtracking, thus ensuring predictable runtime and
> ensuring that no form of internal backtracking stack needs to be used
> (let alone hit its limit).

So how general is that? The books just says "certain common
expressions". I think the real answer is that one can unroll
exactly the true regular expressions. The unrolling is
essentially resolving the states of a DFA by hand.


--
--Bryan

Bryan Olson

unread,
Sep 13, 2006, 12:26:01 AM9/13/06
to
Licheng Fang wrote:
> Basically, the problem is this:
>
>>>> p = re.compile("do|dolittle")
>>>> p.match("dolittle").group()
> 'do'
>
> Python's NFA regexp engine trys only the first option, and happily
> rests on that. There's another example:
>
>>>> p = re.compile("one(self)?(selfsufficient)?")
>>>> p.match("oneselfsufficient").group()
> 'oneself'
>
> The Python regular expression engine doesn't exaust all the
> possibilities, but in my application I hope to get the longest possible
> match, starting from a given point.
>
> Is there a way to do this in Python?


A reply from Tim Peters reminded me that I once programmed a
simple special case of this problem. In particular, given a
lot of strings, my function returns a regexp that efficiently
matches the longest of prefix equal to any of the strings.
If the application is like the examples, where the target RE
is just the or of strings, the method should work well.


http://groups.google.com/group/comp.lang.python/msg/a22eb274845653e9


Now I'm thinking of a more general version, that handles all
true regular expressions.


--
--Bryan

ga...@dsdata.it

unread,
Sep 13, 2006, 3:35:16 AM9/13/06
to

Frederic Rentsch wrote:


> If you need regexes, why not just reverse-sort your expressions? This
> seems a lot easier and faster than writing another regex compiler.
> Reverse-sorting places the longer ones ahead of the shorter ones.

Unfortunately, not all regular expressions have a fixed match length.
Which is the longest of, for example, /(abc)?def/ and /(def)?ghi/
depends on the input.

Lorenzo Gatti

Frederic Rentsch

unread,
Sep 13, 2006, 10:40:04 AM9/13/06
to pytho...@python.org
Very true! Funny you should remind me, considering that I spent quite
some time upgrading SE to allow regular expressions. Version 1 didn't
and could resolve precedence at compile time. Version 2 resolves
precedence at runtime by length of the matches and should function
correctly in this respect, although it might not function fast enough
for speed-critical applications. But then there is in general a
trade-off between convenience and speed.


Frederic

Frederic Rentsch

unread,
Sep 13, 2006, 10:45:35 AM9/13/06
to pytho...@python.org
Oh yes, and my proposal to reverse-sort the targets was in response to
the OP who wanted to generate a regex from a bunch of strings. It should
work for that.


Frederic

Licheng Fang

unread,
Sep 14, 2006, 11:29:19 PM9/14/06
to
Thank you guys. I've written a CYK parser and realized this is the
right direction. It gives every possible interpretation of the string
and I can retrieve whatever I want.

Tim Peters

unread,
Sep 16, 2006, 7:03:33 PM9/16/06
to pytho...@python.org
[Tim Peters]
>> [...] The most valuable general technique [Friedl] (eventually ;-)

>> explained he called "unrolling", and consists of writing a regexp in
>> the form:
>>
>> normal* (?: special normal* )*
>>
>> where the sets of characters with which `normal` and `special` can
>> start are disjoint. Under most "NFA" implementation, such a regexp is
>> immune to most bad behaviors, whether finding very long matches, or in
>> searching a very long string that happens not to contain any matches.

[Bryan Olson]


> So how general is that?

Between 6 and 7 ;-)

> The books just says "certain common expressions". I think the real
> answer is that one can unroll exactly the true regular expressions.

I'm sure he didn't have that much in mind; e.g., it's a real strain to
fit a fat alternation of fixed strings into that "pattern"
(cat|dog|wombat|weasel|...).'

> The unrolling is essentially resolving the states of a DFA by hand.

Oh yes, for those who've studied the theoretical underpinnings. Most
people have not. Much of what he gives as practical advice amounts to
ways to "trick" an "NFA" engine into doing what a "DFA" engine would
do, but explained from the POV of how a backtracking search engine
works. And it makes /sense/ at that level too, divorced from any
knowledge of how a linear-time automaton might be constructed in
theory; e.g., you "want to" give the search engine only one choice
because you want to avoid the possibility of needing to backtrack
later, not because you're picturing a theoretical linear-time
automaton and striving to mimic its behavior.

That explanation may not work for you, but it's effective for people
who haven't studied the theory here. Such explanations also extend
naturally to gimmicks like backreferences, which have no counterpart
in CompSci regexps.

0 new messages