Re: emulating an and operator in regular expressions

Craig Ringer

unread,

Jan 3, 2005, 6:06:38 AM1/3/05

to Ross La Haye, Comp. Lang. Python

On Mon, 2005-01-03 at 08:52, Ross La Haye wrote:
> How can an and operator be emulated in regular expressions in Python?
> Specifically, I want to return a match on a string if and only if 2 or more
> substrings all appear in the string. For example, for a string s = 'Jones
> John' and substrings sg0 = 'Jones' and sg1 = 'John', I want to return a
> match, but for substrings sg0 = 'Jones' and sg2 = 'Frank' I do not want to
> return a match. Because the expression 'A and B' is logically equivalent to
> 'not (not A or not B)' I thought I could try something along those lines,
> but can't crack it.

My first thought would be to express your 'A and B' regex as:

(A.*B)|(B.*A)

with whatever padding, etc, is necessary. You can even substitute in the
sub-regex for A and B to avoid writing them out twice.

--
Craig Ringer

Terry Reedy

unread,

Jan 3, 2005, 2:56:34 PM1/3/05

to pytho...@python.org

"Craig Ringer" <cr...@postnewspapers.com.au> wrote in message
news:1104750397.2...@rasputin.localnet...

> On Mon, 2005-01-03 at 08:52, Ross La Haye wrote:
>> How can an and operator be emulated in regular expressions in Python?

Regular expressions are designed to define and detect repetition and
alternatives. These are easily implemented with finite state machines.
REs not meant for conjunction. 'And' can be done but, as I remember, only
messily and slowly. The demonstration I once read was definitely
theoretical, not practical.

Python was designed for and logic (among everything else). If you want
practical code, use it.

if match1 and match2: do whatever.

Terry J. Reedy

John Machin

unread,

Jan 3, 2005, 4:46:02 PM1/3/05

to

Provided you are careful to avoid overlapping matches e.g. data = 'Fred
Johnson', query = ('John', 'Johnson').

Even this approach (A follows B or B follows A) gets tricky in the real
world of the OP, who appears to be attempting some sort of name
matching, where the word order may be scrambled. Problem is, punters
can have more than 2 words in their names, e.g. Mao Ze Dong[*], Louise
de la Valliere, and Johann Georg Friedrich von und zu Hohenlohe ... or
misreading handwriting can change the number of perceived words, e.g.
Walenkamp -> Wabu Kamp (no kidding).

[*] aka Mao Zedong aka Mao Tse Tung -- difficult enough before we start
considering variations in the order of the words.

Andrew Dalke

unread,

Jan 3, 2005, 5:09:17 PM1/3/05

to

Craig Ringer wrote:
> My first thought would be to express your 'A and B' regex as:
>
> (A.*B)|(B.*A)
>
> with whatever padding, etc, is necessary. You can even substitute in the
> sub-regex for A and B to avoid writing them out twice.

That won't work because of overlaps. Consider

barkeep

with a search for A='bark' and B='keep'.

Neither A.*B nor B.*A will match because the 'k' needs to
be in both A and B.

The OP asked for words, so consecutive letters separated
by non-letters or end of string. With that restriction
this solution will work.

Another possibility is to use positive assertions, as in
(?=A)(?=.*B)|(?=B)(?=.*A)

The best solution is to do a string.find and not worry about
implementing this as a regexp.

Andrew
da...@dalkescientific.com