I have a slightly complicated/medium sized regular expression and I
want to generate all possible words that it can match (to compare
performance of regex against an acora based matcher). Using the
regular expression as a grammar to generate all words in its language.
I was wondering if this possible in Python or possible using anything.
Google doesnt seem to give any obvious answers.
Many thanks in advance,
Nathan
I've done this some time ago.
This recipe
http://code.activestate.com/recipes/577041-merge-multiple-potentially-infinite-sorted-inputs-/
provides an infinite merge operation, required for effectively enumerating
a regular language (with shorter strings first, else 'a*b' would get stuck
repeating "a"s and never yielding any "b").
The example at the end shows the basics of how to use it (by hand) in a
very simple case.
To enumerate all strings matching '(a|bc)*' one should invoke
closure(product(['a'],['b','c'])), which gives:
'', 'a', 'aa', 'bc',
'aaa', 'abc', 'bca',
'aaaa', 'aabc', 'abca', 'bcaa', 'bcbc',
'aaaaa', 'aaabc', 'aabca', 'abcaa', 'abcbc', 'bcaaa', 'bcabc', 'bcbca',
...
Converting the former expression into the later function calls requires a
parser (not shown -- and I won't be able to find it until Friday)
--
Gabriel Genellina
Unless you limit your regexp to bounded regular expressions (you
don't use the "*", "+" and unbounded-top-end "{...}" tokens), it
has an infinite number of possible words that can match. As a
simple example, the regexp "a*" matches "" (the empty string),
"a", "aa", "aaa", "aaaa"...and so on up to the limit of a string
length.
There was a discussion on this a while back:
http://mail.python.org/pipermail/python-list/2010-February/1235120.html
so you might chat with the person who started that thread. You
basically have to write a regexp parser that generates code to
generate the expressions, and even simple expressions such as
"[a-z]{1,30}" can create an astronomical number of possible inputs.
-tkc
> I have a slightly complicated/medium sized regular expression and I
> want to generate all possible words that it can match
>
> I was wondering if this possible in Python or possible using
> anything. Google doesnt seem to give any obvious answers.
We did this one a couple weeks ago.
It's not possible in the general case (there are an infinite number of
matching words for many/most regular expressions).
--
Grant Edwards grant.b.edwards Yow! ... the MYSTERIANS are
at in here with my CORDUROY
gmail.com SOAP DISH!!
The pyparsing wiki Examples page includes this regex inverter:
http://pyparsing.wikispaces.com/file/view/invRegex.py
From the module header:
# Supports:
# - {n} and {m,n} repetition, but not unbounded + or * repetition
# - ? optional elements
# - [] character ranges
# - () grouping
# - | alternation
-- Paul
Thanks again,
Nathan
On 1 April 2010 10:17, Gabriel Genellina <gags...@yahoo.com.ar> wrote:
> En Wed, 31 Mar 2010 12:23:48 -0300, Paul McGuire <pt...@austin.rr.com>
> escribió:
> I took the liberty of modifying your invRegex.py example, adding support
> for infinite repeaters. It depends on two other modules:
>
> mergeinf.py (from http://code.activestate.com/recipes/577041) provides the
> infinite merge operation.
>
> enumre.py provides the basic functions (merge, prod, repeat, closure)
> necessary to enumerate the language generated by a given regular
> expression, even if it contains unbounded repeaters like *,+. The key is
> to generate shorter strings before longer ones, so in 'a*|b*' it doesn't
> get stuck generating infinite a's before any b.
>
> By example, "(a|bc)*d" corresponds to this code:
>
> prod(
> closure(
> merge(
> 'a',
> prod('b','c'))),
> 'd')
>
> which returns an infinite generator starting with:
>
> d
> ad
> aad
> bcd
> aaad
> abcd
> bcad
> aaaad
> aabcd
> abcad
> bcaad
> bcbcd
> aaaaad
> aaabcd
> aabcad
> ...
>
>
> I got the idea from
> http://userweb.cs.utexas.edu/users/misra/Notes.dir/RegExp.pdf
>
> Finally, invRegexInf.py is based on your original regex parser. I only
> modified the generation part, taking advantage of the above
> infrastructure; the parser itself remains almost the same. It essentially
> saves oneself the very tedious work of converting a regular expression
> into the equivalent sequence of function calls as shown above. (I hope I
> got it right: I like pyparsing a lot and use it whenever I feel it's
> appropriate, but not as often as to remember the details...)
>
> --
> Gabriel Genellina
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>