Generating all permutations from a regexp

BJörn Lindqvist

unread,

Dec 22, 2006, 9:30:11 AM12/22/06

to pytho...@python.org

With regexps you can search for strings matching it. For example,
given the regexp: "foobar\d\d\d". "foobar123" would match. I want to
do the reverse, from a regexp generate all strings that could match
it.

The regexp: "[A-Z]{3}\d{3}" should generate the strings "AAA000",
"AAA001", "AAA002" ... "AAB000", "AAB001" ... "ZZZ999".

Is this possible to do? Obviously, for some regexps the set of matches
is unbounded (a list of everything that matches "*" would be very
unpractical), but how would you do it for simple regexps like the one
above?

--
mvh Björn

Fredrik Lundh

unread,

Dec 22, 2006, 11:04:04 AM12/22/06

to pytho...@python.org

BJörn Lindqvist wrote:

here's a start:

http://mail.python.org/pipermail/python-list/2001-August/102739.html

(the above generates *some* strings, not all, but the approach it uses
can be generalized).

</F>

Nick Craig-Wood

unread,

Dec 22, 2006, 1:30:05 PM12/22/06

to

A regular expression matcher uses a state machine to match strings.

What you want to do is do a traversal through that state machine
generating all possible matches at each point.

Luckily the python regexp matcher is written in python, so you have
access to the state machine directly if you want.

Take a look at sre*.py in the python library and you might be able to
work out what to do! I had a brief look myself, and it
looked... complicated!

--
Nick Craig-Wood <ni...@craig-wood.com> -- http://www.craig-wood.com/nick

Fredrik Lundh

unread,

Dec 22, 2006, 1:46:45 PM12/22/06

to pytho...@python.org

Nick Craig-Wood wrote:

> A regular expression matcher uses a state machine to match strings.

unless it's the kind of regular expression matcher that doesn't use a
state machine, like the one in Python.

</F>

Paul McGuire

unread,

Dec 22, 2006, 10:39:55 PM12/22/06

to

On Dec 22, 8:30 am, "BJörn Lindqvist" <bjou...@gmail.com> wrote:
> With regexps you can search for strings matching it. For example,
> given the regexp: "foobar\d\d\d". "foobar123" would match. I want to
> do the reverse, from a regexp generate all strings that could match
> it.
>
> The regexp: "[A-Z]{3}\d{3}" should generate the strings "AAA000",
> "AAA001", "AAA002" ... "AAB000", "AAB001" ... "ZZZ999".
>
> Is this possible to do?

Here is a first cut at your problem
(http://pyparsing-public.wikispaces.com/space/showimage/invRegex.py).
I used pyparsing to identify repeatable ranges within a regex, then
attached generator-generating classes to parse actions for each type of
regex element. Some limitations:
- unbounded '*' and '+' repetition is not allowed
- only supports \d, \w, and \s macros

Here are the test cases in the program that generate the expected list
of permutations:
[A-E]
[A-D]{3}
X[A-C]{3}Y
X[A-C]{3}\(
X\d
[A-C]\s[A-C]\s[A-C]
[A-C]\s?[A-C][A-C]
[A-C]\s([A-C][A-C])
[A-C]\s([A-C][A-C])?
[A-C]{2}\d{2}

Nick Craig-Wood

unread,

Dec 23, 2006, 1:32:46 AM12/23/06

to

Ah! Well that is outside of my experience then... I've only really
studied the matchers produced by flex before. Apologies for the
mis-information!

Chris Johnson

unread,

Dec 23, 2006, 7:23:09 AM12/23/06

to

For a very small number of characters, it would be feasible. For any
finite number of characters, it would be possible (though it wouldn't
take much to take longer than the age of the universe). For reference,
in your simple example, you have 17,576,000 matching strings.

I'm curious as to why you would wish to do this. I certainly understand
considering hard problems for their own sake, but when I formulate
them, there's always some impetus that makes me say "Huh. Now I
wonder..."

Thomas Ploch

unread,

Dec 23, 2006, 5:42:33 PM12/23/06

to pytho...@python.org

Fredrik Lundh wrote:

> Nick Craig-Wood wrote:
>
>> A regular expression matcher uses a state machine to match strings.
>

> unless it's the kind of regular expression matcher that doesn't use a
> state machine, like the one in Python.
>

> </F>
>

How is the matching engine implemented then?

I thought regular languages can be described by deterministic /
non-deterministic finite state machines. I am just curious ...

Thomas

Chris Johnson

unread,

Dec 23, 2006, 5:58:57 PM12/23/06

to

Regular expressions are described by N?DFAs, but Perl-compatible
regular expressions (which pretty much every language implements now)
are not true regular expressions. They are actually Turing complete,
and thus have features that cannot be implemented in N?DFAs.

BJörn Lindqvist

unread,

Dec 25, 2006, 6:36:32 PM12/25/06

to pytho...@python.org

On 12/22/06, Fredrik Lundh <fre...@pythonware.com> wrote:
> BJörn Lindqvist wrote:
> > With regexps you can search for strings matching it. For example,
> > given the regexp: "foobar\d\d\d". "foobar123" would match. I want to
> > do the reverse, from a regexp generate all strings that could match
> > it.
> >
> > The regexp: "[A-Z]{3}\d{3}" should generate the strings "AAA000",
> > "AAA001", "AAA002" ... "AAB000", "AAB001" ... "ZZZ999".
> >
> > Is this possible to do? Obviously, for some regexps the set of matches
> > is unbounded (a list of everything that matches "*" would be very
> > unpractical), but how would you do it for simple regexps like the one
> > above?
>

> here's a start:
>
> http://mail.python.org/pipermail/python-list/2001-August/102739.html

Thankyou! Who would have known that there is a sre_parse module...
makes things quite a bit easier.

--
mvh Björn

BJörn Lindqvist

unread,

Dec 25, 2006, 6:48:24 PM12/25/06

to pytho...@python.org

With some help from the sre_parse module (and the Python Cookbook),
here is the completed module:

# -*- coding: utf-8 -*-
import itertools
from sre_constants import *
import sre_parse
import string

category_chars = {
CATEGORY_DIGIT : string.digits,
CATEGORY_SPACE : string.whitespace,
CATEGORY_WORD : string.digits + string.letters + '_'
}

def unique_extend(res_list, list):
for item in list:
if item not in res_list:
res_list.append(item)

def handle_any(val):
"""
This is different from normal regexp matching. It only matches
printable ASCII characters.
"""
return string.printable

def handle_branch((tok, val)):
all_opts = []
for toks in val:
opts = permute_toks(toks)
unique_extend(all_opts, opts)
return all_opts

def handle_category(val):
return list(category_chars[val])

def handle_in(val):
out = []
for tok, val in val:
out += handle_tok(tok, val)
return out

def handle_literal(val):
return [chr(val)]

def handle_max_repeat((min, max, val)):
"""
Handle a repeat token such as {x,y} or ?.
"""
subtok, subval = val[0]

if max > 5000:
# max is the number of cartesian join operations needed to be
# carried out. More than 5000 consumes way to much memory.
raise ValueError("To many repetitions requested (%d)" % max)

optlist = handle_tok(subtok, subval)

iterlist = []
for x in range(min, max + 1):
joined = join([optlist] * x)
iterlist.append(joined)

return (''.join(it) for it in itertools.chain(*iterlist))

def handle_range(val):
lo, hi = val
return (chr(x) for x in range(lo, hi + 1))

def handle_subpattern(val):
return list(permute_toks(val[1]))

def handle_tok(tok, val):
"""
Returns a list of strings of possible permutations for this regexp
token.
"""
handlers = {
ANY : handle_any,
BRANCH : handle_branch,
CATEGORY : handle_category,
LITERAL : handle_literal,
IN : handle_in,
MAX_REPEAT : handle_max_repeat,
RANGE : handle_range,
SUBPATTERN : handle_subpattern}
try:
return handlers[tok](val)
except KeyError, e:
fmt = "Unsupported regular expression construct: %s"
raise ValueError(fmt % tok)

def permute_toks(toks):
"""
Returns a generator of strings of possible permutations for this
regexp token list.
"""
lists = [handle_tok(tok, val) for tok, val in toks]
return (''.join(it) for it in join(lists))

def join(iterlist):
"""
Cartesian join as an iterator of the supplied sequences. Borrowed
from:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/302478
"""
def rloop(seqin, comb):
if seqin:
for item in seqin[0]:
newcomb = comb + [item]
for item in rloop(seqin[1:], newcomb):
yield item
else:
yield comb
return rloop(iterlist, [])

########## PUBLIC API ####################

def ipermute(p):
toks = [tok_n_val for tok_n_val in sre_parse.parse(p)]
return permute_toks(toks)

def permute(p):
return list(ipermute(p))

Used like this:

from permute import ipermute

for s in ipermute('[A-Z]\d'):
print s

Almost all regular expression constructs are supported except for '*'
(which in the Python sre_parse implementation matches 0 to 65535
times), '+' and '^'. Non-ASCII characters doesn't work, but I think
that is a limitation in the sre_parse module. It works by first
parsing the regexp to string sequences so that [A-Z] becomes ['A',
'B', ... 'Z'], \d becomes ['1', ... , '9']. Then a Cartesian join is
applied on all string sequences to get all possible permutations of
them all.

Suggestions for improvements very much appreciated.

--
mvh Björn

Paul McGuire

unread,

Dec 25, 2006, 7:04:02 PM12/25/06

to

"Paul McGuire" <pt...@austin.rr.com> wrote in message
news:1166845194....@79g2000cws.googlegroups.com...

On Dec 22, 8:30 am, "BJörn Lindqvist" <bjou...@gmail.com> wrote:
> With regexps you can search for strings matching it. For example,
> given the regexp: "foobar\d\d\d". "foobar123" would match. I want to
> do the reverse, from a regexp generate all strings that could match
> it.
>
> The regexp: "[A-Z]{3}\d{3}" should generate the strings "AAA000",
> "AAA001", "AAA002" ... "AAB000", "AAB001" ... "ZZZ999".
>
> Is this possible to do?

Here is a first cut at your problem
(http://pyparsing-public.wikispaces.com/space/showimage/invRegex.py).
I used pyparsing to identify repeatable ranges within a regex, then
attached generator-generating classes to parse actions for each type of
regex element. Some limitations:
- unbounded '*' and '+' repetition is not allowed
- only supports \d, \w, and \s macros

=====================

Download the latest version of this file. It is now importable as its own
module, with the invert method that takes a regexp string and returns a
generator that yields all the possible matching strings. This file also
includes a simple count method, which returns the number of elements
returned by a generator (as opposed to calling len(list(invert("..."))),
which generates an intermediate list just to invoke len on it).

The reg exp features that have been added are:
- alternation using '|'
- min-max repetition using {min,max} format
- '.' wildcard character

Also fixed some repetition bugs, where "foobar{2}" was treated like
"(foobar){2}" - now both cases are handled correctly.

-- Paul

BJörn Lindqvist

unread,

Dec 25, 2006, 7:12:15 PM12/25/06

to pytho...@python.org

I have a thousand use cases in mind. For example:

1. Generate sentences for a bot:

ipermute("(I|You|He|She|It|They) do( not)? (dis)?like (you|him|her|me|they)"):

Generates many grammatically incorrect sentences but you get the point.

2. Generate URL:s to web resources:

The following should generate URL:s to all c.l.p digests from the mail archive:

def download_clp():
year_re = "(199\d|200[0-6])"
months = ["January",
"February",
"March",
"April",
"May",
"June",
"July",
"August",
"September",
"October",
"November",
"December"]
month_re = '(' + '|'.join(months) + ')'
fmt = "http://mail\.python\.org/pipermail/python-list/%s-%s\.txt"
url_re = fmt % (year_re, month_re)
for x in ipermute(url_re):
print "Downloading", x
code to download here

The same approach could be used to download all threads in a forum for
example, or all articles on Slashdot.

3. "Visualising" regular expressions. I think it would be easier to
write regular expressions if you could see what kind of data they
would match.

4. Port scanning:

ip_tuple = "(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])"
for x in ipermute("192\.168\." + "\.".join([ip_tuple] * 2)):
scan_ip(x)

--
mvh Björn