> Why are you too lazy to do any research before posting a question?
Errr... what?
I'm only somewhat familiar with the extra stuff that languages provide in their regexs beyond true regular expressions and simple extensions, but I was surprised to see the question because I too would have expected that to work. (And match any sequence of whitespace characters whose length is a multiple of six.) I reskimmed the documentation of the re module and didn't see anything that would prohibit it. I looked at several of the results of a Google search for the multiple repeat error, and didn't really find any explanation beyond "because you can't do it" or "here's a regex that works." (Well, OK, I did see a mention of + being a possessive quantifier which Python doesn't support. But that still doesn't explain why my expectation isn't what happened.)
-----Original Message-----
From: Python-list [mailto:python-list-bounces+saroo_jain=infosys....@python.org] On Behalf Of Mark Lawrence
Sent: Friday, October 05, 2012 3:29 AM
To: python-l...@python.org
Subject: Re: + in regular expression
On 04/10/2012 04:01, contro opinion wrote:
>>>> str=" gg"
>>>> x1=re.match("\s+",str)
>>>> x1
> <_sre.SRE_Match object at 0xb7354db0>
>>>> x2=re.match("\s{6}",str)
>>>> x2
> <_sre.SRE_Match object at 0xb7337f38>
>>>> x3=re.match("\s{6}+",str)
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/usr/lib/python2.6/re.py", line 137, in match
> return _compile(pattern, flags).match(string)
> File "/usr/lib/python2.6/re.py", line 245, in _compile
> raise error, v # invalid expression
> sre_constants.error: multiple repeat
> why the "\s{6}+" is not a regular pattern?
Why are you too lazy to do any research before posting a question?
**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***
On 03Oct2012 21:17, Ian Kelly <ian.g.ke...@gmail.com> wrote:
| On Wed, Oct 3, 2012 at 9:01 PM, contro opinion <contropin...@gmail.com> wrote:
| > why the "\s{6}+" is not a regular pattern?
| | Use a group: "(?:\s{6})+"
Yeah, it is probably a precedence issue in the grammar.
"(\s{6})+" is also accepted.
-- Cameron Simpson <c...@zip.com.au>
Disclaimer: ERIM wanted to share my opinions, but I wouldn't let them.
- David Wiseman <dwise...@erim.org>
Cameron Simpson <c...@zip.com.au> wrote:
> On 03Oct2012 21:17, Ian Kelly <ian.g.ke...@gmail.com> wrote:
>| On Wed, Oct 3, 2012 at 9:01 PM, contro opinion
>| <contropin...@gmail.com> wrote: >| > why the "\s{6}+" is not a regular pattern?
>| >| Use a group: "(?:\s{6})+"
> Yeah, it is probably a precedence issue in the grammar.
> "(\s{6})+" is also accepted.
It's about syntax, not precedence, but the documentation doesn't really spell it out in full. Like most regex documentation it talks in woolly terms about special characters rather than giving a formal syntax.
A regular expression element may be followed by a quantifier. Quantifiers are '*', '+', '?', '{n}', '{n,m}' (and lazy quantifiers '*?', '+?', '{n,m}?'). There's nothing in the regex language which says you can follow an element with two quantifiers. Parentheses (grouping or non-grouping) around a regex turn that regex into a single element which is why you can then use another quantifier.
In bnf, I think Python's regexes would be somthing like:
re ::= union | simple-re
union ::= re | simple-re
simple-re ::= concatenation | basic-re
concatenation ::= simple-re basic-re
basic-re ::= element | element quantifier
element ::= group | nc-group | "." | "^" | "$" | char | charset
quantifier = "*" | "+" | "?" | "{" NUMBER "}" | "{" NUMBER "," NUMBER "}" |"*?" | "+?" | "{" NUMBER "," NUMBER "}?"
group ::= "(" re ")"
nc-group ::= "(?:" re ")"
char = <any non-special character> | "\" <any character>
... and so on. I didn't include charsets or all the (?...) extensions or special sequences.
> A regular expression element may be followed by a quantifier.
> Quantifiers are '*', '+', '?', '{n}', '{n,m}' (and lazy quantifiers
> '*?', '+?', '{n,m}?'). There's nothing in the regex language which says
> you can follow an element with two quantifiers.
In fact, *you* did -- the first sentence of that paragraph! :-)
\s is a regex, so you can follow it with a quantifier and get \s{6}. That's also a regex, so you should be able to follow it with a quantifier.
I can understand that you can create a grammar that excludes it. I'm actually really interested to know if anyone knows whether this was a deliberate decision and, if so, what the reason is. (And if not -- should it be considered a (low priority) bug?)
Was it because such patterns often reveal a mistake? Because "\s{6}+" has other meanings in different regex syntaxes and the designers didn't want confusion? Because it was simpler to parse that way? Because the "hey you recognize regular expressions by converting it to a finite automaton" story is a lie in most real-world regex implementations (in part because they're not actually regular expressions) and repeated quantifiers cause problems with the parsing techniques that actually get used?
> On 10/05/2012 04:23 AM, Duncan Booth wrote:
>> A regular expression element may be followed by a quantifier.
>> Quantifiers are '*', '+', '?', '{n}', '{n,m}' (and lazy quantifiers
>> '*?', '+?', '{n,m}?'). There's nothing in the regex language which says
>> you can follow an element with two quantifiers.
> In fact, *you* did -- the first sentence of that paragraph! :-)
> \s is a regex, so you can follow it with a quantifier and get \s{6}. > That's also a regex, so you should be able to follow it with a > quantifier.
OK, I guess this isn't true... you said a "regular expression *element*" can be followed by a quantifier. I just took what I usually see as part of a regular expression and read into your post something it didn't quite say. Still, the rest of mine applies.
> On 10/05/2012 04:23 AM, Duncan Booth wrote:
>> A regular expression element may be followed by a quantifier.
>> Quantifiers are '*', '+', '?', '{n}', '{n,m}' (and lazy quantifiers
>> '*?', '+?', '{n,m}?'). There's nothing in the regex language which says
>> you can follow an element with two quantifiers.
> In fact, *you* did -- the first sentence of that paragraph! :-)
> \s is a regex, so you can follow it with a quantifier and get \s{6}.
> That's also a regex, so you should be able to follow it with a quantifier.
> I can understand that you can create a grammar that excludes it. I'm
> actually really interested to know if anyone knows whether this was a
> deliberate decision and, if so, what the reason is. (And if not --
> should it be considered a (low priority) bug?)
> Was it because such patterns often reveal a mistake? Because "\s{6}+"
> has other meanings in different regex syntaxes and the designers didn't
> want confusion? Because it was simpler to parse that way? Because the
> "hey you recognize regular expressions by converting it to a finite
> automaton" story is a lie in most real-world regex implementations (in
> part because they're not actually regular expressions) and repeated
> quantifiers cause problems with the parsing techniques that actually get
> used?
You rarely want to repeat a repeated element. It can also result in catastrophic
backtracking unless you're _very_ careful.
In many other regex implementations (including mine), "*+", "*+" and
"?+" are possessive quantifiers, much as "??", "*?" and "??" are lazy
quantifiers.
You could, of course, ask why adding "?" after a quantifier doesn't
make it optional, e.g. why r"\s{6}?" doesn't mean the same as
r"(?:\s{6})?", or why r"\s{0,6}?" doesn't mean the same as
r"(?:\s{0,6})?".
On 05Oct2012 10:27, Evan Driscoll <drisc...@cs.wisc.edu> wrote:
| I can understand that you can create a grammar that excludes it. [...]
| Was it because such patterns often reveal a mistake?
For myself, I would consider that sufficient reason.
I've seen plenty of languages (C and shell, for example, though they
are not alone or egrarious) where a compiler can emit a syntax complaint
many lines from the actual coding mistake (in shell, an unclosed quote
or control construct is a common examplei; Python has the same issue
but mitigated by the indentation requirements which cut the occurence
down a lot).
Forbidding a common error by requiring a wordier workaround isn't
unreasonable.
| Because "\s{6}+" | has other meanings in different regex syntaxes and the designers didn't | want confusion?
I think Python REs are supposed to be Perl compatible; ISTR an opening
sentence to that effect...
| Because it was simpler to parse that way? Because the | "hey you recognize regular expressions by converting it to a finite | automaton" story is a lie in most real-world regex implementations (in | part because they're not actually regular expressions) and repeated | quantifiers cause problems with the parsing techniques that actually get | used?
There are certainly constructs that can cause an exponential amount
of backtracking is misused. One could make a case for discouragement
(though not a case for forbidding them).
Cameron Simpson <c...@zip.com.au> wrote:
>| Because "\s{6}+" >| has other meanings in different regex syntaxes and the designers didn't >| want confusion?
> I think Python REs are supposed to be Perl compatible; ISTR an opening
> sentence to that effect...
I don't know the full history of how regex engines evolved, but I suspect at least part of the answer is that the decisions the Perl developers made influenced the other implementations.
Perl's quantifiers allow both '?' and '+' as modifiers on the standard quantifiers so clearly you cannot stack those particular quantifiers in Perl, therefore quantifiers in general are unstackable.
The only grammars I can find online for regular expressions split out the elements and quantifiers the way I did in my previous post. Python's regex parser (and I would guess also most of the others in existence) tend more to the spaghetti code than following a grammar (_parse is a 238 line function). So I think it really is just trying to match existing regular expression parsers and any possible grammar is an excuse for why it should be the way it is rather than an explanation.