Generating text from a regular expression

Nathan Harmston

unread,

Mar 31, 2010, 6:49:14 AM3/31/10

to pytho...@python.org

Hi everyone,

I have a slightly complicated/medium sized regular expression and I
want to generate all possible words that it can match (to compare
performance of regex against an acora based matcher). Using the
regular expression as a grammar to generate all words in its language.
I was wondering if this possible in Python or possible using anything.
Google doesnt seem to give any obvious answers.

Many thanks in advance,

Nathan

Gabriel Genellina

unread,

Mar 31, 2010, 7:58:03 AM3/31/10

to pytho...@python.org

En Wed, 31 Mar 2010 07:49:14 -0300, Nathan Harmston
<iwanttob...@googlemail.com> escribió:

I've done this some time ago.
This recipe
http://code.activestate.com/recipes/577041-merge-multiple-potentially-infinite-sorted-inputs-/
provides an infinite merge operation, required for effectively enumerating
a regular language (with shorter strings first, else 'a*b' would get stuck
repeating "a"s and never yielding any "b").
The example at the end shows the basics of how to use it (by hand) in a
very simple case.
To enumerate all strings matching '(a|bc)*' one should invoke
closure(product(['a'],['b','c'])), which gives:

'', 'a', 'aa', 'bc',
'aaa', 'abc', 'bca',
'aaaa', 'aabc', 'abca', 'bcaa', 'bcbc',
'aaaaa', 'aaabc', 'aabca', 'abcaa', 'abcbc', 'bcaaa', 'bcabc', 'bcbca',
...

Converting the former expression into the later function calls requires a
parser (not shown -- and I won't be able to find it until Friday)

--
Gabriel Genellina

Tim Chase

unread,

Mar 31, 2010, 8:01:57 AM3/31/10

to Nathan Harmston, pytho...@python.org

Nathan Harmston wrote:
> I have a slightly complicated/medium sized regular expression
> and I want to generate all possible words that it can match
> (to compare performance of regex against an acora based
> matcher). Using the regular expression as a grammar to
> generate all words in its language. I was wondering if this
> possible in Python or possible using anything. Google doesnt
> seem to give any obvious answers.

Unless you limit your regexp to bounded regular expressions (you
don't use the "*", "+" and unbounded-top-end "{...}" tokens), it
has an infinite number of possible words that can match. As a
simple example, the regexp "a*" matches "" (the empty string),
"a", "aa", "aaa", "aaaa"...and so on up to the limit of a string
length.

There was a discussion on this a while back:

http://mail.python.org/pipermail/python-list/2010-February/1235120.html

so you might chat with the person who started that thread. You
basically have to write a regexp parser that generates code to
generate the expressions, and even simple expressions such as
"[a-z]{1,30}" can create an astronomical number of possible inputs.

-tkc

Grant Edwards

unread,

Mar 31, 2010, 10:13:32 AM3/31/10

to

On 2010-03-31, Nathan Harmston <iwanttob...@googlemail.com> wrote:

> I have a slightly complicated/medium sized regular expression and I
> want to generate all possible words that it can match
>

> I was wondering if this possible in Python or possible using
> anything. Google doesnt seem to give any obvious answers.

We did this one a couple weeks ago.

It's not possible in the general case (there are an infinite number of
matching words for many/most regular expressions).

--
Grant Edwards grant.b.edwards Yow! ... the MYSTERIANS are
at in here with my CORDUROY
gmail.com SOAP DISH!!

Paul McGuire

unread,

Mar 31, 2010, 11:23:48 AM3/31/10

to

On Mar 31, 5:49 am, Nathan Harmston <iwanttobeabad...@googlemail.com>
wrote:

> Hi everyone,
>
> I have a slightly complicated/medium sized regular expression and I
> want to generate all possible words that it can match (to compare
> performance of regex against an acora based matcher).

The pyparsing wiki Examples page includes this regex inverter:
http://pyparsing.wikispaces.com/file/view/invRegex.py

From the module header:
# Supports:
# - {n} and {m,n} repetition, but not unbounded + or * repetition
# - ? optional elements
# - [] character ranges
# - () grouping
# - | alternation

-- Paul

Nathan Harmston

unread,

Apr 2, 2010, 7:22:58 PM4/2/10

to Gabriel Genellina, pytho...@python.org

Thanks everyone, the invRegexInf is perfect.

Thanks again,

Nathan

On 1 April 2010 10:17, Gabriel Genellina <gags...@yahoo.com.ar> wrote:
> En Wed, 31 Mar 2010 12:23:48 -0300, Paul McGuire <pt...@austin.rr.com>
> escribió:

> I took the liberty of modifying your invRegex.py example, adding support
> for infinite repeaters. It depends on two other modules:
>
> mergeinf.py (from http://code.activestate.com/recipes/577041) provides the
> infinite merge operation.
>
> enumre.py provides the basic functions (merge, prod, repeat, closure)
> necessary to enumerate the language generated by a given regular
> expression, even if it contains unbounded repeaters like *,+. The key is
> to generate shorter strings before longer ones, so in 'a*|b*' it doesn't
> get stuck generating infinite a's before any b.
>
> By example, "(a|bc)*d" corresponds to this code:
>
> prod(
> closure(
> merge(
> 'a',
> prod('b','c'))),
> 'd')
>
> which returns an infinite generator starting with:
>
> d
> ad
> aad
> bcd
> aaad
> abcd
> bcad
> aaaad
> aabcd
> abcad
> bcaad
> bcbcd
> aaaaad
> aaabcd
> aabcad
> ...
>
>
> I got the idea from
> http://userweb.cs.utexas.edu/users/misra/Notes.dir/RegExp.pdf
>
> Finally, invRegexInf.py is based on your original regex parser. I only
> modified the generation part, taking advantage of the above
> infrastructure; the parser itself remains almost the same. It essentially
> saves oneself the very tedious work of converting a regular expression
> into the equivalent sequence of function calls as shown above. (I hope I
> got it right: I like pyparsing a lot and use it whenever I feel it's
> appropriate, but not as often as to remember the details...)
>
> --
> Gabriel Genellina
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>