Re: + in regular expression

Ian Kelly

unread,

Oct 3, 2012, 11:17:19 PM10/3/12

to Python

On Wed, Oct 3, 2012 at 9:01 PM, contro opinion <contro...@gmail.com> wrote:
> why the "\s{6}+" is not a regular pattern?

Use a group: "(?:\s{6})+"

Mark Lawrence

unread,

Oct 4, 2012, 5:59:27 PM10/4/12

to pytho...@python.org

On 04/10/2012 04:01, contro opinion wrote:
>>>> str=" gg"
>>>> x1=re.match("\s+",str)
>>>> x1
> <_sre.SRE_Match object at 0xb7354db0>
>>>> x2=re.match("\s{6}",str)
>>>> x2
> <_sre.SRE_Match object at 0xb7337f38>
>>>> x3=re.match("\s{6}+",str)
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/usr/lib/python2.6/re.py", line 137, in match
> return _compile(pattern, flags).match(string)
> File "/usr/lib/python2.6/re.py", line 245, in _compile
> raise error, v # invalid expression
> sre_constants.error: multiple repeat

>>>>
>
> why the "\s{6}+" is not a regular pattern?
>
>
>

Why are you too lazy to do any research before posting a question?

--
Cheers.

Mark Lawrence.

Evan Driscoll

unread,

Oct 4, 2012, 10:25:40 PM10/4/12

to pytho...@python.org

On 10/04/2012 04:59 PM, Mark Lawrence wrote:
>> why the "\s{6}+" is not a regular pattern?
>>
>>
>>
>
>
> Why are you too lazy to do any research before posting a question?
>

Errr... what?

I'm only somewhat familiar with the extra stuff that languages provide
in their regexs beyond true regular expressions and simple extensions,
but I was surprised to see the question because I too would have
expected that to work. (And match any sequence of whitespace characters
whose length is a multiple of six.) I reskimmed the documentation of the
re module and didn't see anything that would prohibit it. I looked at
several of the results of a Google search for the multiple repeat error,
and didn't really find any explanation beyond "because you can't do it"
or "here's a regex that works." (Well, OK, I did see a mention of +
being a possessive quantifier which Python doesn't support. But that
still doesn't explain why my expectation isn't what happened.)

In what way is that an unreasonable question?

Evan

Saroo Jain

unread,

Oct 4, 2012, 11:44:25 PM10/4/12

to Mark Lawrence, pytho...@python.org

x3=re.match("\s{6}+",str)

instead use
x3=re.match("\s{6,}",str)

This serves the purpose. And also give some food for thought for why the first one throws an error.

Cheers,
Saroo

-----Original Message-----
From: Python-list [mailto:python-list-bounces+saroo_jain=infos...@python.org] On Behalf Of Mark Lawrence
Sent: Friday, October 05, 2012 3:29 AM
To: pytho...@python.org
Subject: Re: + in regular expression

On 04/10/2012 04:01, contro opinion wrote:
>>>> str=" gg"
>>>> x1=re.match("\s+",str)
>>>> x1
> <_sre.SRE_Match object at 0xb7354db0>
>>>> x2=re.match("\s{6}",str)
>>>> x2
> <_sre.SRE_Match object at 0xb7337f38>
>>>> x3=re.match("\s{6}+",str)
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/usr/lib/python2.6/re.py", line 137, in match
> return _compile(pattern, flags).match(string)
> File "/usr/lib/python2.6/re.py", line 245, in _compile
> raise error, v # invalid expression
> sre_constants.error: multiple repeat
>>>>
>

> why the "\s{6}+" is not a regular pattern?
>
>
>

Why are you too lazy to do any research before posting a question?

--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list

**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are not
to copy, disclose, or distribute this e-mail or its contents to any other person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has taken
every reasonable precaution to minimize this risk, but is not liable for any damage
you may sustain as a result of any virus in this e-mail. You should carry out your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

Ian Kelly

unread,

Oct 5, 2012, 1:14:13 AM10/5/12

to Python

On Thu, Oct 4, 2012 at 9:44 PM, Saroo Jain <Saroo...@infosys.com> wrote:
> x3=re.match("\s{6}+",str)
>
> instead use
> x3=re.match("\s{6,}",str)
>
> This serves the purpose. And also give some food for thought for why the first one throws an error.

That matches six or more spaces, not multiples of six spaces.

Cameron Simpson

unread,

Oct 5, 2012, 1:22:28 AM10/5/12

to Ian Kelly, Python

On 03Oct2012 21:17, Ian Kelly <ian.g...@gmail.com> wrote:
| On Wed, Oct 3, 2012 at 9:01 PM, contro opinion <contro...@gmail.com> wrote:

| > why the "\s{6}+" is not a regular pattern?
|

| Use a group: "(?:\s{6})+"

Yeah, it is probably a precedence issue in the grammar.
"(\s{6})+" is also accepted.
--
Cameron Simpson <c...@zip.com.au>

Disclaimer: ERIM wanted to share my opinions, but I wouldn't let them.
- David Wiseman <dwis...@erim.org>

Duncan Booth

unread,

Oct 5, 2012, 5:23:26 AM10/5/12

to

Cameron Simpson <c...@zip.com.au> wrote:

> On 03Oct2012 21:17, Ian Kelly <ian.g...@gmail.com> wrote:
>| On Wed, Oct 3, 2012 at 9:01 PM, contro opinion
>| <contro...@gmail.com> wrote:
>| > why the "\s{6}+" is not a regular pattern?
>|
>| Use a group: "(?:\s{6})+"
>
> Yeah, it is probably a precedence issue in the grammar.
> "(\s{6})+" is also accepted.

It's about syntax, not precedence, but the documentation doesn't really
spell it out in full. Like most regex documentation it talks in woolly
terms about special characters rather than giving a formal syntax.

A regular expression element may be followed by a quantifier.
Quantifiers are '*', '+', '?', '{n}', '{n,m}' (and lazy quantifiers
'*?', '+?', '{n,m}?'). There's nothing in the regex language which says
you can follow an element with two quantifiers. Parentheses (grouping or
non-grouping) around a regex turn that regex into a single element which
is why you can then use another quantifier.

In bnf, I think Python's regexes would be somthing like:

re ::= union | simple-re
union ::= re | simple-re
simple-re ::= concatenation | basic-re
concatenation ::= simple-re basic-re
basic-re ::= element | element quantifier
element ::= group | nc-group | "." | "^" | "$" | char | charset
quantifier = "*" | "+" | "?" | "{" NUMBER "}" | "{" NUMBER "," NUMBER
"}" |"*?" | "+?" | "{" NUMBER "," NUMBER "}?"
group ::= "(" re ")"
nc-group ::= "(?:" re ")"
char = <any non-special character> | "\" <any character>

... and so on. I didn't include charsets or all the (?...) extensions or
special sequences.

--
Duncan Booth http://kupuguy.blogspot.com

Evan Driscoll

unread,

Oct 5, 2012, 11:27:00 AM10/5/12

to pytho...@python.org

On 10/05/2012 04:23 AM, Duncan Booth wrote:
> A regular expression element may be followed by a quantifier.
> Quantifiers are '*', '+', '?', '{n}', '{n,m}' (and lazy quantifiers
> '*?', '+?', '{n,m}?'). There's nothing in the regex language which says
> you can follow an element with two quantifiers.

In fact, *you* did -- the first sentence of that paragraph! :-)

\s is a regex, so you can follow it with a quantifier and get \s{6}.
That's also a regex, so you should be able to follow it with a quantifier.

I can understand that you can create a grammar that excludes it. I'm
actually really interested to know if anyone knows whether this was a
deliberate decision and, if so, what the reason is. (And if not --
should it be considered a (low priority) bug?)

Was it because such patterns often reveal a mistake? Because "\s{6}+"
has other meanings in different regex syntaxes and the designers didn't
want confusion? Because it was simpler to parse that way? Because the
"hey you recognize regular expressions by converting it to a finite
automaton" story is a lie in most real-world regex implementations (in
part because they're not actually regular expressions) and repeated
quantifiers cause problems with the parsing techniques that actually get
used?

Evan

Evan Driscoll

unread,

Oct 5, 2012, 11:31:26 AM10/5/12

to pytho...@python.org

On 10/05/2012 10:27 AM, Evan Driscoll wrote:
> On 10/05/2012 04:23 AM, Duncan Booth wrote:

>> A regular expression element may be followed by a quantifier.
>> Quantifiers are '*', '+', '?', '{n}', '{n,m}' (and lazy quantifiers
>> '*?', '+?', '{n,m}?'). There's nothing in the regex language which says
>> you can follow an element with two quantifiers.

> In fact, *you* did -- the first sentence of that paragraph! :-)
>
> \s is a regex, so you can follow it with a quantifier and get \s{6}.
> That's also a regex, so you should be able to follow it with a
> quantifier.

OK, I guess this isn't true... you said a "regular expression *element*"
can be followed by a quantifier. I just took what I usually see as part
of a regular expression and read into your post something it didn't
quite say. Still, the rest of mine applies.

Evan

MRAB

unread,

Oct 5, 2012, 12:07:47 PM10/5/12

to pytho...@python.org

On 2012-10-05 16:27, Evan Driscoll wrote:
> On 10/05/2012 04:23 AM, Duncan Booth wrote:

>> A regular expression element may be followed by a quantifier.
>> Quantifiers are '*', '+', '?', '{n}', '{n,m}' (and lazy quantifiers
>> '*?', '+?', '{n,m}?'). There's nothing in the regex language which says
>> you can follow an element with two quantifiers.

> In fact, *you* did -- the first sentence of that paragraph! :-)
>
> \s is a regex, so you can follow it with a quantifier and get \s{6}.
> That's also a regex, so you should be able to follow it with a quantifier.
>

> I can understand that you can create a grammar that excludes it. I'm
> actually really interested to know if anyone knows whether this was a
> deliberate decision and, if so, what the reason is. (And if not --
> should it be considered a (low priority) bug?)
>
> Was it because such patterns often reveal a mistake? Because "\s{6}+"
> has other meanings in different regex syntaxes and the designers didn't
> want confusion? Because it was simpler to parse that way? Because the
> "hey you recognize regular expressions by converting it to a finite
> automaton" story is a lie in most real-world regex implementations (in
> part because they're not actually regular expressions) and repeated
> quantifiers cause problems with the parsing techniques that actually get
> used?
>

You rarely want to repeat a repeated element. It can also result in
catastrophic
backtracking unless you're _very_ careful.

In many other regex implementations (including mine), "*+", "*+" and
"?+" are possessive quantifiers, much as "??", "*?" and "??" are lazy
quantifiers.

You could, of course, ask why adding "?" after a quantifier doesn't
make it optional, e.g. why r"\s{6}?" doesn't mean the same as
r"(?:\s{6})?", or why r"\s{0,6}?" doesn't mean the same as
r"(?:\s{0,6})?".

Cameron Simpson

unread,

Oct 5, 2012, 7:37:42 PM10/5/12

to Evan Driscoll, pytho...@python.org

On 05Oct2012 10:27, Evan Driscoll <dris...@cs.wisc.edu> wrote:
| I can understand that you can create a grammar that excludes it. [...]

| Was it because such patterns often reveal a mistake?

For myself, I would consider that sufficient reason.

I've seen plenty of languages (C and shell, for example, though they
are not alone or egrarious) where a compiler can emit a syntax complaint
many lines from the actual coding mistake (in shell, an unclosed quote
or control construct is a common examplei; Python has the same issue
but mitigated by the indentation requirements which cut the occurence
down a lot).

Forbidding a common error by requiring a wordier workaround isn't
unreasonable.

| Because "\s{6}+"
| has other meanings in different regex syntaxes and the designers didn't
| want confusion?

I think Python REs are supposed to be Perl compatible; ISTR an opening
sentence to that effect...

| Because it was simpler to parse that way? Because the
| "hey you recognize regular expressions by converting it to a finite
| automaton" story is a lie in most real-world regex implementations (in
| part because they're not actually regular expressions) and repeated
| quantifiers cause problems with the parsing techniques that actually get
| used?

There are certainly constructs that can cause an exponential amount
of backtracking is misused. One could make a case for discouragement
(though not a case for forbidding them).

Just my 2c,
--
Cameron Simpson <c...@zip.com.au>

The most annoying thing about being without my files after our disc crash was
discovering once again how widespread BLINK was on the web.

Duncan Booth

unread,

Oct 9, 2012, 7:29:16 AM10/9/12

to

Cameron Simpson <c...@zip.com.au> wrote:

>| Because "\s{6}+"
>| has other meanings in different regex syntaxes and the designers didn't
>| want confusion?
>
> I think Python REs are supposed to be Perl compatible; ISTR an opening
> sentence to that effect...
>

I don't know the full history of how regex engines evolved, but I suspect
at least part of the answer is that the decisions the Perl developers made
influenced the other implementations.

Perl's quantifiers allow both '?' and '+' as modifiers on the standard
quantifiers so clearly you cannot stack those particular quantifiers in
Perl, therefore quantifiers in general are unstackable.

The only grammars I can find online for regular expressions split out the
elements and quantifiers the way I did in my previous post. Python's regex
parser (and I would guess also most of the others in existence) tend more
to the spaghetti code than following a grammar (_parse is a 238 line
function). So I think it really is just trying to match existing regular
expression parsers and any possible grammar is an excuse for why it should
be the way it is rather than an explanation.