"&"
A&B, where A and B can be arbitrary REs, creates a regular expression that will match both
A and B in any order. An arbitrary number of REs can be separated by the "&" in this way.
This can be used inside groups as well. REs separated by "&" are tried in arbitrary order,
the beginning of the first matching branch up to the end of the last branch that allows
the complete pattern (all branches) to match is considered the span of the match.
I envision it being used where a string is accepted at present if it matches multiple REs.
The REs could be rewritten as r"<RE#1>&<RE2>&<RE3>...".
The use of "&" could be supported by more efficient, (simultaneous?), matching of all
branches of the "&" construct - if such algorithms exist!
Just an idea,
Paddy.
Am I correct in thinking that
A&B
is equivalent in meaning to
(AB)|(BA)
except for the lack of grouping characters and repetition of the
subpatterns?
Jeff
I take it this means:
match(r1&r2,s) <==> match(r1,s) and match(r2,s)
For example:
r1 = aa? (a or aa)
r2 = aaa? (aa or aaa)
implies that only "aa" matches r1&r2
But what's the purpose?
I assume (without a formal prove at this point) that r1&r2 can
always be reformulated as a simpler expression, BIMBW.
--
René Pijlman
It can always be reformulated as an expression without an intersection
('&') operator. But not necessarily a simpler one.
BIMBW?
I've never understood either why the intersection operator is usually
missing from regular expression implementations. Tradition?
- Anders
> It can always be reformulated as an expression without an intersection
> ('&') operator. But not necessarily a simpler one.
>
> BIMBW?
>
> I've never understood either why the intersection operator is usually
> missing from regular expression implementations. Tradition?
Probably for the reason you mention: It can easily be reexpressed as
another regular expression, whether simpler or more complicated. That
isn't true in general, for, say, the | regular expression operator
--
Erik Max Francis / m...@alcyone.com / http://www.alcyone.com/max/
__ San Jose, CA, USA / 37 20 N 121 53 W / &tSftDotIotE
/ \ All the people in her neighborhood turn around and get mad and sing
\__/ Public Enemy
CSBuddy / http://www.alcyone.com/pyos/csbuddy/
A Counter-Strike server log file monitor in Python.
Could you give an example?
>BIMBW?
But I May Be Wrong :-)
>I've never understood either why the intersection operator is usually
>missing from regular expression implementations. Tradition?
The Dragon Book seems to take it for granted. It's not even an
exercise.
There's a whole body of theory built on the concept of regular
expressions, including formal language theory, finite state
automata and computational complexity. All of that may have to
be redesigned to accomodate the intersection in regular
expressions. I dunno, but if it aint broken... :-)
--
René Pijlman
But OTOH there are notational shortcuts in the usual RE
languages. For example:
a+ == aa*
r{3,5} == rrrr?r?
So what makes the intersection different?
--
René Pijlman
Anders> BIMBW?
Anders> I've never understood either why the intersection operator is usually
Anders> missing from regular expression implementations. Tradition?
Because in general, the intersection of two regular expressions is not
a regular expression, in the sense that it cannot generally be represented
by a regular grammar or parsed efficiently by a finite automaton.
--
Andrew Koenig, a...@research.att.com, http://www.research.att.com/info/ark
> But OTOH there are notational shortcuts in the usual RE
> languages. For example:
>
> a+ == aa*
> r{3,5} == rrrr?r?
>
> So what makes the intersection different?
Probably frequency of use and utility. I can't think of a single case
in my long history of usage of regular expressions (both simple and
extended) where an intersection would have been useful (and warranted as
opposed to just writing out what that was equivalent to without
intersections). (In fact, it wasn't entirely clear to me what the
original poster intended with his & operator, which is why I withheld
comment until others had made their guesses.)
This is usually done by running two regexp searches. It can be faked via
positive lookahead assertions, like
pat = re.compile(r'(?=.*x)(?=.*y)')
Then pat.match(string) succeeds if and only if string contains x and string
contains y (but both before the first newline character (if any), since .
doesn't match a newline by default).
> ..
> The use of "&" could be supported by more efficient, (simultaneous?),
> matching of all branches of the "&" construct - if such algorithms
> exist!
For a so-called DFA algorithm, yes, because the intersection of regular
languages is also regular, and membership in any regular language can be
recognized in one pass over the string in question. What to do in the
presence of (non-regular) backreferences is clear as mud, though, and
Python's regexp package isn't a DFA engine anyway. I don't know of any
regexp package that supports an intersection operator, so it would also
suffer from novelty. All in all, better to do the obvious thing (i.e., run
more than one regexp).
> Because in general, the intersection of two regular expressions is not
> a regular expression, in the sense that it cannot generally be represented
> by a regular grammar or parsed efficiently by a finite automaton.
I believe you're incorrect about this. Regular languages are closed
over intersection.
I guess you're right -- you can always make a DFA with states that
consist of the cartesian product of the states of the operand machines.
But implementing it efficiently may be another matter.
Give that a little more thought. If you have DFAs for regular languages L
and M, the intersection corresponds to running them both in lockstep, where
an accepting state of the combined DFA is hit whenever you land in
acceptings states for both of the original DFAs. This can be formalized
easily enough be constructing a new DFA whose states are the cross-product
of the original DFAs' states -- then the rest is obvious <wink>.
Frequency of use for a feature not implemented is usually 0 ;-)
- Anders
Have done it, it's easy at the DFA level.
- Anders
> Frequency of use for a feature not implemented is usually 0 ;-)
Frequency of use predicated on the supposition that it would exist,
obviously. That is, I don't hear a lot of people clamoring for such
functionality, so it's probably not really high on anyone's list. (This
is the first time I've ever heard of someone suggesting such a feature
in many years' use of regular expressions.)
--
Erik Max Francis / m...@alcyone.com / http://www.alcyone.com/max/
__ San Jose, CA, USA / 37 20 N 121 53 W / &tSftDotIotE
/ \ Custom reconciles us to everything.
\__/ Edmund Burke
Computer science / http://www.alcyone.com/max/reference/compsci/
A computer science reference.
1) & was hardly ever used.
2) Implementation problems. Suppose you have a&b&c&d&e&f&g&h. There are 40320 patterns that match this, but this is more than the number of backtracking states that the Python RE engine supports.
_____________________________________________________________________
This message has been checked for all known viruses by the MessageLabs Virus Scanning Service.
Taking statistical evidence from a hypothetical scenario, hm.
>That is, I don't hear a lot of people clamoring for such
> functionality, so it's probably not really high on anyone's list. (This
> is the first time I've ever heard of someone suggesting such a feature
> in many years' use of regular expressions.)
>
How can one miss what one doesn't know?
- Anders
The old IBM editors allowed to search for multiple words in any order, but
it was not based on regular expressions. If I remember right, it was used
mainly when you couldn't remember if a variable was 'string_length' or
'length_of_string', so it's more for interactive use than for writing
programs.
Daniel
Hmmm, use cases.
I'm doing some thought experiments around the problem of Property matching in Electronics.
You create properties describing your design, i.e. "if R=0 then R=1 in the next cycle
prove that S=1 within the next 3 cycles" for your design.
You can mathematically prove some assertions on designs but for complex assertions or
large designs, formal proof is too resource hungry so what you can do is simulate your
design, producing the state of signals R and S over many cycles, then check if the
property holds for this simulation, and how many times the property holds in this
simulation (COVERAGE RESULTS).
Their is likely to be many properties to check on the design
In my thought experiments I thought that properties could be translated to REs in some
way, and simulation results into strings.
When thinking of the need to match several properties, (OK, it could be done serially), I
came up with my original question!
Pad.
Hmmm, use cases.
Thanks all, for your input so far.
Having thought of it, and seen the posts on if it is necessary or not, or needed or not;
it seems a bit odd that we are left with one of the few places where the or 'operation' is
defined, but the orthogonal and 'operation' is not ?!
Paddy.
> Taking statistical evidence from a hypothetical scenario, hm.
It's a judgement call.
> How can one miss what one doesn't know?
Well, since I keep saying in the parts that you're clipping, I won't
bother repeating myself again.
--
Erik Max Francis / m...@alcyone.com / http://www.alcyone.com/max/
__ San Jose, CA, USA / 37 20 N 121 53 W / &tSftDotIotE
/ \ Strange is our situation here upon earth.
\__/ Albert Einstein
Esperanto reference / http://www.alcyone.com/max/lang/esperanto/
An Esperanto reference for English speakers.
I seem to detect two quite different usages of "intersection":
(1) The regular language (RL) characterised by the RE "abc" produces
just one string i.e. 'abc'. Ditto the RL described by the RE "xyz"
produces just one string 'xyz'. I (a mug punter, not an expert) would
understand the intersection of the two RLs to be empty i.e. [] i.e.
the two RLs produce no common strings. Same with "a+" and "b+",
whereas "a*" and "b*" have a non-empty intersection [''].
(2) The effect that the OP (together with Tim and other posters)
seemed to expect from the & operator was that (abc)&(xyz) would match
(abcxyz)|(xyzabc).
<:-)>
And now for a pragmatic point: if the & operator was ever so slightly
useful, or ever so slightly demanded, or ever so slightly easy to
implement, it would have surfaced in a certain language long before
now.
</:-)>
> I seem to detect two quite different usages of "intersection":
> (1) The regular language (RL) characterised by the RE "abc" produces
> just one string i.e. 'abc'. Ditto the RL described by the RE "xyz"
> produces just one string 'xyz'. I (a mug punter, not an expert) would
> understand the intersection of the two RLs to be empty i.e. [] i.e.
> the two RLs produce no common strings. Same with "a+" and "b+",
> whereas "a*" and "b*" have a non-empty intersection [''].
> (2) The effect that the OP (together with Tim and other posters)
> seemed to expect from the & operator was that (abc)&(xyz) would match
> (abcxyz)|(xyzabc).
I think the difference is that in Python, the RE "abc" matches any
string containing 'abc', not just the exact string itself. I'd guess
most formalists who aren't familiar with Python would write that
regular expression as ".*abc.*", and expect they's need to write
"(.*abc.*)&(.*xyz.*)" to match both 'abcxyz' and 'xyzabc'.
A certain molluskian by-product language seems to have Big Things in mind
for REs in it's next incarnation, should it happen. The write-up is detailed
and interesting.
Dave LeBlanc
Seattle, WA USA
Most people think of the Python regexp "abc" matching any string containing
"abc", though. You can't get very far discussing regexps in terms of
regular expressions <0.9 wink>.
> I (a mug punter, not an expert) would understand the intersection of the
> two RLs to be empty i.e. [] i.e. the two RLs produce no common strings.
Yes.
> Same with "a+" and "b+", whereas "a*" and "b*" have a non-empty
> intersection [''].
Both so.
> (2) The effect that the OP (together with Tim and other posters)
> seemed to expect from the & operator was that (abc)&(xyz) would match
> (abcxyz)|(xyzabc).
I expected that the OP expected it would be equivalent to the RE
^[\x00-\xff]*(abc[\x00-\xff]*xyz|xyz[\x00-\xff]*abc)[\x00-\xff]*\Z
in terms of whether or not it matched.
> <:-)>
> And now for a pragmatic point: if the & operator was ever so slightly
> useful, or ever so slightly demanded, or ever so slightly easy to
> implement, it would have surfaced in a certain language long before
> now.
> </:-)>
RL intersection *is* routinely implemented in finite-state machine packages.
For example, see
http://www.research.att.com/~mohri/fsm/
and that package's fsmintersect program. I can think of very few plausible
use cases for it in string-matching, though ("it's a Sunday and a Thursday"
or "it's an identifier and it's a number" make less sense than "it's a floor
wax and a dessert topping"), and in this respect it's relevant that common
notations for grammar specification (like EBNF) don't bother with an
intersection operator either.
I am using "match" in the ordinary sense which coincides with the
Python definition of re.match(), whereas you seem to be using "match"
to mean what re.search() does. If I had have meant the RE ".*abc.*", I
would have said that.
But do they think indentation for block structure is useful? Demanded?
>
>
> A certain molluskian by-product language seems to have Big Things in mind
> for REs in it's next incarnation, should it happen. The write-up is detailed
> and interesting.
>
> Dave LeBlanc
> Seattle, WA USA
>
>
Yeh,
I've bin there, read about it. from that I got the point that something that has been
stagnant for many years - a languages regexp package - is now being re-thought in quite
revolutionary ways.
If they can do it, it makes me think that the re packages contents are up for review!
(That document on perl 6 REs I guess were both on about).
Pad.
Nope, formalists like me (well in this case anyway) would say that
regular expression "abc" doesn't actively match anything by itself.
It just defines a set of strings. 'abcxyz' and 'xyzabc' are not
members of the set defined by the RE "abc". To use it for something,
you need to use an active component such as re.match or re.search,
which do not check if the entire string and the RE match, but look for
matching _sub_strings.
I used the word "intersection" exactly to clarify that, regardless of
what the OP meant for an "&" operator to do, what I was talking about
was the "formalist" operation of intersection, which really is just
set intersection. Please reserve "intersection" for that use, or the
confusion will be total.
- Anders
I'm sorry, I don't follow you. If you have written something earlier
that addresses "How can one miss what one doesn't know?" then I am
unable to find it.
- Anders
This is not in the usual syntax, but I hope you can make something of
it anyway. (Ever since I wrote a utility with cleaned-up regexp
syntax, traditional syntax is just too painful for anything
non-trivial.)
Here's a regular expression that matches a C comment unambiguously.
:ccomment = "/*" ([^"*"] | "*"+ [^"*/"])* "*"+ "/"
Say I have another regexp that matches certain sorts of special syntax
that might occur in comments:
:encoding = "-*-" [" "A-Za-z]* "coding:" [A-Za-z0-9"- "]+ "-*-"
Now combine these two to form a regular expression to match comments
that contain encoding instructions. With intersection, it's fairly
simple:
:containsencoding = .* :encoding .*
:commentwithencoding = :ccomment & :containsencoding
I haven't tried to express this without intersection, and I wouldn't
care to try either.
the-real-fun-begins-when-you-have-a-complement-operator-as-well-ly
y'rs, Anders
It's useful in combination with complement ("it's a .*day but not a
Sunday").
But then complement is probably too spooky for most people.
- Anders
If that's the use case, it's more an argument for adding a difference
operator (like R-S = the strings matched by R less the strings matched by
S).
> But then complement is probably too spooky for most people.
Python has negative lookahead assertions now -- but I'm not sure I've ever
seen them used <wink>.
That might be so. Not that I'm proposing to add anything of either
kind.
It's easier though to express difference through intersection and
complement
R-S = R & ~S
than the other way around
R & S = (R|S) - (R-S) - (S-R)
- Anders