[Python-ideas] What about regexp string litterals : re".*" ?

108 views
Skip to first unread message

Simon D.

unread,
Mar 27, 2017, 11:26:07 AM3/27/17
to python...@python.org
Hello,

After some french discussions about this idea, I subscribed here to
suggest adding a new string litteral, for regexp, inspired by other
types like : u"", r"", b"", br"", f""…

The regexp string litteral could be represented by : re""

It would ease the use of regexps in Python, allowing to have some regexp
litterals, like in Perl or JavaScript.

We may end up with an integration like :

>>> import re
>>> if re".k" in 'ok':
... print "ok"
ok
>>>

Regexps are part of the language in Perl, and the rather complicated
integration of regexp in other languages, especially in Python, is
something that comes up easily in language comparing discussion.

I've always felt JavaScript integration being half the way it should,
and new string litterals types in Python (like f"") looked like a good
compromise to have a tight integration of regexps without asking to make
them part of the language (as I imagine it has already been discussed
years ago, and obviously denied…).

As per XKCD illustration, using a regexp may be a problem on its own,
but really, the "each-language a new and complicated approach" is
another difficulty, of the level of writing regexps I think. And then,
when you get the trick for Python, it feels to me still to much letters
to type regarding the numerous problems one can solve using regexps.

I know regexps are slower than string-based workflow (like .startswith)
but regexps can do the most and the least, so they are rapide to come up
with, once you started to think with them. As Python philosophy is to
spare brain-cycles, sacrificing CPU-cycles, allowing to easily use
regexps is a brain-cycle savior trick.

What do you think ?

--
Simon Descarpentries
+336 769 702 53
http://acoeuro.com
_______________________________________________
Python-ideas mailing list
Python...@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Serhiy Storchaka

unread,
Mar 27, 2017, 11:40:30 AM3/27/17
to python...@python.org
On 27.03.17 18:17, Simon D. wrote:
> After some french discussions about this idea, I subscribed here to
> suggest adding a new string litteral, for regexp, inspired by other
> types like : u"", r"", b"", br"", f""…
>
> The regexp string litteral could be represented by : re""
>
> It would ease the use of regexps in Python, allowing to have some regexp
> litterals, like in Perl or JavaScript.

There are several regular expression libraries for Python. One of them
is included in the stdlib, but this is not the first regular expression
library in the stdlib and may be not the last. Particular project can
choose using an alternative regular expression library (because it has
additional features or is faster for particular cases).

Steven D'Aprano

unread,
Mar 27, 2017, 11:08:27 PM3/27/17
to python...@python.org
On Mon, Mar 27, 2017 at 05:17:40PM +0200, Simon D. wrote:

> The regexp string litteral could be represented by : re""
>
> It would ease the use of regexps in Python, allowing to have some regexp
> litterals, like in Perl or JavaScript.
>
> We may end up with an integration like :
>
> >>> import re
> >>> if re".k" in 'ok':
> ... print "ok"
> ok

I dislike the suggested syntax re".k". It looks ugly and not different
enough from a raw string. I can easily see people accidentally writing:

if r".k" in 'ok':
...

and wondering why their regex isn't working.


Javascript uses /regex/ as a literal syntax for creating RegExp objects.
That's the closest equivalent to the way Python would have to operate,
although I don't think we can use the /.../ syntax without breaking the
rule that Python's parser will not be more complex than LL(1). So I
think /.../ is definitely out.

Perl 6 uses m/regex/ and a number of other variations:

https://docs.perl6.org/language/regexes


I doubt that this will actually be useful. It *seems* useful if you just
write trivial regexes like your example, but without Perl's rich set of
terse (cryptic?) operators, I don't know that literal regexes
makes enough difference to be worth the trouble. There's not very
much difference between (say) these:

mo = re.search(r'.k', mystring)
if mo:
print(mo.group())

mo = re.'.k'.search(mystring)
if mo:
print(mo.group())


You effectively save two parentheses, that's all. That doesn't seem like
much of a win for introducing new syntax. Can you show some example code
where a regex literal will have a worthwhile advantage?


> Regexps are part of the language in Perl, and the rather complicated
> integration of regexp in other languages, especially in Python, is
> something that comes up easily in language comparing discussion.

Surely you are joking?

Regex integration in Python is simple. Regular expression objects are
ordinary objects, like lists and dicts and floats. The only difference
is that you don't call the Regex object constructor directly, you either
pass a string to a module level function

re.match(r'my regex', mystring)

or you create a regex object:

regex = re.compile(r'my regex')
regex.match(mystring)


That's very neat, Pythonic and simple. The regex itself is very close to
the same syntax uses by Perl, Javascript or other variations, the only
complication is that due to Python's escaping rules you should use a raw
string r'' instead of doubling up all backslashes. I wouldn't call that
"rather complicated" -- it is a lot less complicated than Perl:

- m// can be abbreviated //
- when do you use // directly and when do you use qr// ?
- s/// operator implicitly defines a regex

In Perl 6, I *think* they use rx// instead of qr//, or are they
different things? Both m// and the s/// operator can use arbitrary
delimiters, e.g. ! or , (but not : or parentheses) instead of the
slashes, and m// regexes will implicitly match against $_ if you don't
explicitly match against something else.

Compared to Perl, I don't think Python's regexes are complicated.


--
Steve

Markus Meskanen

unread,
Mar 27, 2017, 11:25:20 PM3/27/17
to Steven D'Aprano, Python-Ideas


On Mar 28, 2017 06:08, "Steven D'Aprano" <st...@pearwood.info> wrote:
On Mon, Mar 27, 2017 at 05:17:40PM +0200, Simon D. wrote:

> The regexp string litteral could be represented by : re""
>
> It would ease the use of regexps in Python, allowing to have some regexp
> litterals, like in Perl or JavaScript.
>
> We may end up with an integration like :
>
> >>> import re
> >>> if re".k" in 'ok':
>     ... print "ok"
>     ok

I dislike the suggested syntax re".k". It looks ugly and not different
enough from a raw string. I can easily see people accidentally writing:

    if r".k" in 'ok':
        ...

and wondering why their regex isn't working.

While I agree with most of your arguments, surely you must be the one joking here? "Ugly" is obviously a matter of opinion, I personally find the proposed syntax more beautiful than the // used in many other languages. But claiming it's bad because people would mix it up with raw strings and people not realizing is nonsense. Not only does it look very different, but attempting to call match() or any other regex method on it would surely give out a reasonable error:

  AttributeError: 'str' object has no attribute 'match'

Which _in the worst case scenario_ results into googling where the top rated StackOverflow question clearly explains the difference between r'' and re''

Chris Angelico

unread,
Mar 28, 2017, 1:38:18 AM3/28/17
to Python-Ideas
On Tue, Mar 28, 2017 at 2:24 PM, Markus Meskanen
<markusm...@gmail.com> wrote:
> While I agree with most of your arguments, surely you must be the one joking
> here? "Ugly" is obviously a matter of opinion, I personally find the
> proposed syntax more beautiful than the // used in many other languages. But
> claiming it's bad because people would mix it up with raw strings and people
> not realizing is nonsense. Not only does it look very different, but
> attempting to call match() or any other regex method on it would surely give
> out a reasonable error:
>
> AttributeError: 'str' object has no attribute 'match'
>
> Which _in the worst case scenario_ results into googling where the top rated
> StackOverflow question clearly explains the difference between r'' and re''

Yes, but if the "in" operator is used, it would still work, because
r"..." is a str, and "str" in "string" is meaningful.

But I think a better solution will be for regex literals to be
syntax-highlighted differently. If they're a truly-supported syntactic
feature, they can be made visually different in your editor, making
the distinction blatantly obvious.

That said, though, I'm -1 on this. Currently, every prefix letter has
its own meaning, and broadly speaking, combining them combines their
meanings. An re"..." literal should be a raw "e-string", whatever that
is, so I would expect that e"..." is the same kind of thing but with
different backslash handling.

ChrisA

Markus Meskanen

unread,
Mar 28, 2017, 1:46:17 AM3/28/17
to Chris Angelico, Python-Ideas
On Tue, Mar 28, 2017 at 8:37 AM, Chris Angelico <ros...@gmail.com> wrote:
Yes, but if the "in" operator is used, it would still work, because
r"..." is a str, and "str" in "string" is meaningful.

But I think a better solution will be for regex literals to be
syntax-highlighted differently. If they're a truly-supported syntactic
feature, they can be made visually different in your editor, making
the distinction blatantly obvious.

That said, though, I'm -1 on this. Currently, every prefix letter has
its own meaning, and broadly speaking, combining them combines their
meanings. An re"..." literal should be a raw "e-string", whatever that
is, so I would expect that e"..." is the same kind of thing but with
different backslash handling.

Fair enough, I haven't followed this thread too closely and didn't consider the "in" operator being used. Even then I find it unlikely that confusing re'...' with r'...' and not noticing would turn out to be an issue. 
 
That being said, I'm also -1 on this, especially now after your point on "e-string". Adding these re-strings would straight out prevent e-string from ever being implemented.

Simon D.

unread,
Mar 28, 2017, 3:55:31 AM3/28/17
to python...@python.org
* Serhiy Storchaka <stor...@gmail.com> [2017-03-27 18:39:19 +0300]:
> There are several regular expression libraries for Python. One of them is
> included in the stdlib, but this is not the first regular expression library
> in the stdlib and may be not the last. Particular project can choose using
> an alternative regular expression library (because it has additional
> features or is faster for particular cases).
>

I believe that the u"" notation in Python 2.7 is defined by while
importing the unicode_litterals module.

Each regexp lib could provide its instanciation of regexp litteral
notation.

And if only the default one does, it would still be won for the
beginers, and the majority of persons using the stdlib.

--
Simon Descarpentries
+336 769 702 53
http://s.d12s.fr

Simon D.

unread,
Mar 28, 2017, 3:56:59 AM3/28/17
to python...@python.org
* Chris Angelico <ros...@gmail.com> [2017-03-28 16:37:16 +1100]:

> But I think a better solution will be for regex literals to be
> syntax-highlighted differently. If they're a truly-supported syntactic
> feature, they can be made visually different in your editor, making
> the distinction blatantly obvious.
>
> That said, though, I'm -1 on this. Currently, every prefix letter has
> its own meaning, and broadly speaking, combining them combines their
> meanings. An re"..." literal should be a raw "e-string", whatever that
> is, so I would expect that e"..." is the same kind of thing but with
> different backslash handling.

First, I would like to state that the "module-static" version of regexp
functions, avoiding the compile step, are a great idea.
(e.g. : mo = re.search(r'.k', mystring) )
The str integrated one also, but maybe confusing, which regexp lib is
used ? (must be the default one).

Then, re"" being two letters looks like a real problem. Lets pick one
amongs the 22 remaining free alphabet letters. What about :
- g"", x"" (like in regex) ?
- m"" (like shawn for Perl, meaming Match ?)
- q"" (for Query ?)
- k"" (in memory of Stephen Cole Kleene ?
https://en.wikipedia.org/wiki/Regular_expression)
- /"" (to be half the way toward /regexp/ syntax)
- ~"" ?"" (other symbols, I avoid regexp-starting symbols, would be ugly
in real usage)

And what about an approach with flag firsts ? (or where to put them ?) :
i"" (regexp with ignorecase flag on)
AILMSX"" (regexp with all flags on)

It would consume a lot of letters, but would use it for a good reason
:-)

Personnally, I think a JavaScript-like syntaxe would be great, and I
feel it as asking too much… :
- it would naturally be highlihted differently ;
- it would not be the first (happy) similarity
(https://hackernoon.com/javascript-vs-python-in-2017-d31efbb641b4#.ky9it5hph)
- its a working integration, including flag matters.

--
Simon Descarpentries
+336 769 702 53

http://s.d12s.fr

Paul Moore

unread,
Mar 28, 2017, 4:32:00 AM3/28/17
to Simon D., Python-Ideas
On 28 March 2017 at 08:54, Simon D. <si...@acoeuro.com> wrote:
> I believe that the u"" notation in Python 2.7 is defined by while
> importing the unicode_litterals module.

That's not true. The u"..." syntax is part of the language. from
future import unicode_literals is something completely different.

> Each regexp lib could provide its instanciation of regexp litteral
> notation.

The Python language has no way of doing that - user (or library)
defined literals are not possible.

> And if only the default one does, it would still be won for the
> beginers, and the majority of persons using the stdlib.

How? You've yet to prove that having a regex literal form is an
improvement over re.compile(r'put your regex here'). You've asserted
it, but that's a matter of opinion. We'd need evidence of real-life
code that was clearly improved by the existence of your proposed
construct.

Paul

Abe Dillon

unread,
Mar 29, 2017, 4:31:19 PM3/29/17
to Paul Moore, Python-Ideas
My 2 cents is that regular expressions are pretty un-pythonic because of their horrible readability. I would much rather see Python adopt something like Verbal Expressions ( https://github.com/VerbalExpressions/PythonVerbalExpressions ) into the standard library than add special syntax support for normal REs.

Markus Meskanen

unread,
Mar 29, 2017, 5:00:10 PM3/29/17
to Abe Dillon, Python-Ideas


On Mar 29, 2017 23:31, "Abe Dillon" <abed...@gmail.com> wrote:
My 2 cents is that regular expressions are pretty un-pythonic because of their horrible readability. I would much rather see Python adopt something like Verbal Expressions ( https://github.com/VerbalExpressions/PythonVerbalExpressions ) into the standard library than add special syntax support for normal REs.

I've never heard of this before, looks *awesome*. Thanks, if it's as good as it sounds, I too would love something like this added to the standard library.

Ryan Gonzalez

unread,
Mar 29, 2017, 9:26:20 PM3/29/17
to Abe Dillon, python-ideas
I feel like that borders on a bit too wordy...

Personally, I'd like to see something like Felix's regular definitions:




--
Ryan (ライアン)
Yoko Shimomura > ryo (supercell/EGOIST) > Hiroyuki Sawano >> everyone else
http://refi64.com

Abe Dillon

unread,
Mar 29, 2017, 9:48:15 PM3/29/17
to Ryan Gonzalez, python-ideas
I feel like that borders on a bit too wordy...

I think the use of words instead of symbols is one of the things that makes Python so readable. The ternary operator is done with words:

value = option1 if condition else option2 


reads almost like English, while:

value = condition ? option1: option2;


Is just weird.

I can read Verbal Expressions very quickly and understand exactly what's going on. If I have a decent IDE, I can write them almost as easily. I see no problem with wordiness if it means I don't have to stare at the code and scratch my head longer, or worse, open a reference to help me translate it (which is invariably the case when I look at regular expressions).

Chris Angelico

unread,
Mar 29, 2017, 10:00:48 PM3/29/17
to python-ideas
On Thu, Mar 30, 2017 at 12:47 PM, Abe Dillon <abed...@gmail.com> wrote:
>> I feel like that borders on a bit too wordy...
>
>
> I think the use of words instead of symbols is one of the things that makes
> Python so readable. The ternary operator is done with words:
>
> value = option1 if condition else option2
>
> reads almost like English, while:
>
> value = condition ? option1: option2;
>
> Is just weird.
>
> I can read Verbal Expressions very quickly and understand exactly what's
> going on. If I have a decent IDE, I can write them almost as easily. I see
> no problem with wordiness if it means I don't have to stare at the code and
> scratch my head longer, or worse, open a reference to help me translate it
> (which is invariably the case when I look at regular expressions).

However, a huge advantage of REs is that they are common to many
languages. You can take a regex from grep to Perl to your editor to
Python. They're not absolutely identical, of course, but the basics
are all the same. Creating a new search language means everyone has to
learn anew.

ChrisA

Stephen J. Turnbull

unread,
Mar 29, 2017, 11:57:31 PM3/29/17
to Abe Dillon, Python-Ideas
Abe Dillon writes:

> My 2 cents is that regular expressions are pretty un-pythonic because of
> their horrible readability. I would much rather see Python adopt something
> like Verbal Expressions (
> https://github.com/VerbalExpressions/PythonVerbalExpressions ) into the
> standard library than add special syntax support for normal REs.

You think that example is more readable than the proposed transalation

^(http)(s)?(\:\/\/)(www\.)?([^\ ]*)$

which is better written

^https?://(www\.)?[^ ]*$

or even

^https?://[^ ]*$

which makes it obvious that the regexp is not very useful from the
word "^"? (It matches only URLs which are the only thing, including
whitespace, on the line, probably not what was intended.)

Are those groups capturing in Verbal Expressions? The use of "find"
(~ "search") rather than "match" is disconcerting to the experienced
user. What does alternation look like? How about alternation of
non-trivial regular expressions? Etc, etc.

As far as I can see, Verbal Expressions are basically a way of making
it so painful to write regular expressions that people will restrict
themselves to regular expressions that would be quite readable in
traditional notation! I don't think that this failure to respect the
developer's taste is restricted to this particular implementation,
either. They *are* regular expressions, just with a verbose,
obstructive notation.

Far more important than "more readable" regular expressions would be a
parsing library in the stdlib, reducing the developer's temptation to
parse using complex regular expressions. IMHO YMMV etc.

Steve

Nick Coghlan

unread,
Mar 30, 2017, 12:50:34 AM3/30/17
to Simon D., python...@python.org
On 28 March 2017 at 01:17, Simon D. <si...@acoeuro.com> wrote:
> It would ease the use of regexps in Python

We don't really want to ease the use of regexps in Python - while
they're an incredibly useful tool in a programmer's toolkit, they're
so cryptic that they're almost inevitably a maintainability nightmare.

Baking them directly into the language runtime also locks people in to
a particular regex engine implementation, rather than being able to
swap in a third party one if they choose to do so (as many folks
currently do with the `regex` PyPI module).

So it's appropriate to keep them as a string-based library level
capability, and hence on a relatively level playing field with less
comprehensive, but typically easier to maintain, options like string
methods and third party text parsing libraries (such as
https://pypi.python.org/pypi/parse for something close to the inverse
of str.format)

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Simon D.

unread,
Mar 30, 2017, 2:39:35 AM3/30/17
to python...@python.org
* Simon D. <si...@acoeuro.com> [2017-03-28 09:56:05 +0200]:

> The str integrated one also, but maybe confusing, which regexp lib is
> used ? (must be the default one).
>

Ok, this was a mistake, based on JavaScript memories… There is no regexp
aware functions around str, but some hint to go find your happiness in
the re module.

--
Simon Descarpentries
+336 769 702 53

http://acoeuro.com

Abe Dillon

unread,
Mar 30, 2017, 9:39:14 PM3/30/17
to turnbull....@u.tsukuba.ac.jp, python...@python.org

a huge advantage of REs is that they are common to many
languages. You can take a regex from grep to Perl to your editor to
Python. They're not absolutely identical, of course, but the basics
are all the same. Creating a new search language means everyone has to
learn anew.
ChrisA

1) I'm not suggesting we get rid of the re module (the VE implementation I linked requires it)
2) You can easily output regex from verbal expressions
3) verbal expressions are implemented in many different languages too: https://verbalexpressions.github.io/
4) It even has a generic interface that all implementations are meant to follow: https://github.com/VerbalExpressions/implementation/wiki/List-of-methods-to-implement

Note that the entire documentation is 250 words while just the syntax portion of Python docs for the re module is over 3000 words.
 
You think that example is more readable than the proposed transalation
    ^(http)(s)?(\:\/\/)(www\.)?([^\ ]*)$
which is better written
    ^https?://(www\.)?[^ ]*$
or even
    ^https?://[^ ]*$

Yes. I find it far more readable. It's not a soup of symbols like Perl code. I can only surmise that you're fluent in regex because it seems difficult for you to see how the above could be less readable than English words.

which makes it obvious that the regexp is not very useful from the
word "^"?  (It matches only URLs which are the only thing, including
whitespace, on the line, probably not what was intended.)

I could tell it only matches URLs that are the only thing inside the string because it clearly says: 

start_of_line() and end_of_line(). I would have had to refer to a reference to know that "^" doesn't always mean "not", it sometimes means "start of string" and probably other things. I would also have to check a reference to know that "$" can mean "end of string" (and probably other things).

Are those groups capturing in Verbal Expressions?  The use of "find"
(~ "search") rather than "match" is disconcerting to the experienced
user.

You can alternately use the word "then". The source code is just one python file. It's very easy to read. I actually like "then" over "find" for the example:

verbal_expression.start_of_line()
    .then('http')
    .maybe('s')
    .then('://')
    .maybe('www.')
    .anything_but(' ')
    .end_of_line()

What does alternation look like?

.OR(option1).OR(option2).OR(option3)...

How about alternation of
non-trivial regular expressions?

.OR(other_verbal_expression)

As far as I can see, Verbal Expressions are basically a way of making
it so painful to write regular expressions that people will restrict
themselves to regular expressions

What's so painful to write about them? Does your IDE not have autocompletion? I find REs so painful to write that I usually just use string methods if at all feasible.


I don't think that this failure to respect the
developer's taste is restricted to this particular implementation,
either.

I generally find it distasteful to write a 
pseudolanguage in strings inside of other languages (this applies to SQL as well). Especially when the design principals of that pseudolanguage are diametrically opposed to the design principals of the host language. A key principal of Python's design is: "you read code a lot more often than you write code, so emphasize readability". Regex seems to be based on: "Do the most with the fewest key-strokes. Readability be dammed!". It makes a lot more sense to wrap the psudolanguage in constructs that bring it in-line with the host language than to take on the mental burden of trying to comprehend two different languages at the same time.

If you disagree, nothing's stopping you from continuing to write res the old-fashion way. Can we at least agree that baking special re syntax directly into the language is a bad idea?

Stephen J. Turnbull

unread,
Mar 31, 2017, 3:24:52 AM3/31/17
to Abe Dillon, python...@python.org
Abe Dillon writes:

> Note that the entire documentation is 250 words while just the syntax
> portion of Python docs for the re module is over 3000 words.

Since Verbal Expressions (below, VEs, indicating notation) "compile"
to regular expressions (spelling out indicates the internal matching
implementation), the documentation of VEs presumably ignores
everything except the limited language it's useful for. To actually
understand VEs, you need to refer to the RE docs. Not a win IMO.

> > You think that example is more readable than the proposed transalation
> > ^(http)(s)?(\:\/\/)(www\.)?([^\ ]*)$
> > which is better written
> > ^https?://(www\.)?[^ ]*$
> > or even
> > ^https?://[^ ]*$
>
>
> Yes. I find it *far* more readable. It's not a soup of symbols like Perl
> code. I can only surmise that you're fluent in regex because it seems
> difficult for you to see how the above could be less readable than English
> words.

Yes, I'm fairly fluent in regular expression notation (below, REs).
I've maintained a compiler for one dialect.

I'm not interested in the difference between words and punctuation
though. The reason I find the middle RE most readable is that it
"looks like" what it's supposed to match, in a contiguous string as
the object it will match will be contiguous. If I need to parse it to
figure out *exactly* what it matches, yes, that takes more effort.
But to understand a VE's semantics correctly, I'd have to look it up
as often as you have to look up REs because many words chosen to notate
VEs have English meanings that are (a) ambiguous, as in all natural
language, and (b) only approximate matches to RE semantics.

> I could tell it only matches URLs that are the only thing inside
> the string because it clearly says: start_of_line() and
> end_of_line().

That's not the problem. The problem is the semantics of the method
"find". "then" would indeed read better, although it doesn't exactly
match the semantics of concatenation in REs.

> I would have had to refer to a reference to know that "^" doesn't
> always mean "not", it sometimes means "start of string" and
> probably other things. I would also have to check a reference to
> know that "$" can mean "end of string" (and probably other things).

And you'll still have to do that when reading other people's REs.

> > Are those groups capturing in Verbal Expressions? The use of
> > "find" (~ "search") rather than "match" is disconcerting to the
> > experienced user.
>
> You can alternately use the word "then". The source code is just
> one python file. It's very easy to read. I actually like "then"
> over "find" for the example:

You're missing the point. The reader does not get to choose the
notation, the author does. I do understand what several varieties of
RE mean, but the variations are of two kinds: basic versus extended
(ie, what tokens need to be escaped to be taken literally, which ones
have special meaning if escaped), and extensions (which can be
ignored). Modern RE facilities are essentially all of the extended
variety. Once you've learned that, you're in good shape for almost
any RE that should be written outside of an obfuscated code contest.

This is a fundamental principle of Python design: don't make readers
of code learn new things. That includes using notation developed
elsewhere in many cases.

> What does alternation look like?
>
> .OR(option1).OR(option2).OR(option3)...
>
> How about alternation of
> > non-trivial regular expressions?
>
> .OR(other_verbal_expression)

Real examples, rather than pseudo code, would be nice. I think you,
too, will find that examples of even fairly simple nested alternations
containing other constructs become quite hard to read, as they fall
off the bottom of the screen.

For example, the VE equivalent of

scheme = "(https?|ftp|file):"

would be (AFAICT):

scheme = VerEx().then(VerEx().then("http")
.maybe("s")
.OR("ftp")
.OR("file"))
.then(":")

which is pretty hideous, I think. And the colon is captured by a
group. If perversely I wanted to extract that group from a match,
what would its index be?

I guess you could keep the linear arrangement with

scheme = (VerEx().add("(")
.then("http")
.maybe("s")
.OR("ftp")
.OR("file")
.add(")")
.then(":"))

but is that really an improvement over

scheme = VerEx().add("(https?|ftp|file):")

;-)

> > As far as I can see, Verbal Expressions are basically a way of
> > making it so painful to write regular expressions that people
> > will restrict themselves to regular expressions
>
> What's so painful to write about them?

One thing that's painful is that VEs "look like" context-free
grammars, but clumsy and without the powerful semantics. You can get
the readability you want with greater power using grammars, which is
why I would prefer we work on getting a parser module into the stdlib.

But if one doesn't know about grammars, it's still not great. The
main pains about writing VEs for me are (1) reading what I just wrote,
(2) accessing capturing groups, and (3) verbosity. Even a VE to
accurately match what is normally a fairly short string, such as the
scheme, credentials, authority, and port portions of a "standard" URL,
is going to be hundreds of characters long and likely dozens of lines
if folded as in the examples.

Another issue is that we already have a perfectly good poor man's
matching library: glob. The URL example becomes

http{,s}://{,www.}*

Granted you lose the anchors, but how often does that matter? You
apparently don't use them often enough to remember them.

> Does your IDE not have autocompletion?

I don't want an IDE. I have Emacs.

> I find REs so painful to write that I usually just use string
> methods if at all feasible.

Guess what? That's the right thing to do anyway. They're a lot more
readable and efficient when partitioning a string into two or three
parts, or recognizing a short list of affixes. But chaining many
methods, as VEs do, is not a very Pythonic way to write a program.

> > I don't think that this failure to respect the developer's taste
> > is restricted to this particular implementation, either.
>
> I generally find it distasteful to write a pseudolanguage in
> strings inside of other languages (this applies to SQL as well).

You mean like arithmetic operators? (Lisp does this right, right?
Only one kind of expression, the function call!) It's a matter of
what you're used to. I understand that people new to text-processing,
or who don't do so much of it, don't find REs easy to read. So how is
this a huge loss? They don't use regular expressions very often! In
fact, they're far more likely to encounter, and possibly need to
understand, REs written by others!

> Especially when the design principals of that pseudolanguage are
> *diametrically opposed* to the design principals of the host
> language. A key principal of Python's design is: "you read code a
> lot more often than you write code, so emphasize
> readability". Regex seems to be based on: "Do the most with the
> fewest key-strokes.

So is all of mathematics. There's nothing wrong with concise
expression for use in special cases.

> Readability be dammed!". It makes a lot more sense to wrap the
> psudolanguage in constructs that bring it in-line with the host
> language than to take on the mental burden of trying to comprehend
> two different languages at the same time.
>
> If you disagree, nothing's stopping you from continuing to write
> res the old-fashion way.

I don't think that RE and SQL are "pseudo" languages, no. And I, and
most developers, will continue to write regular expressions using the
much more compact and expressive RE notation. (In fact with the
exception of the "word" method, in VEs you still need to use RE notion
to express most of the Python extensions.) So what you're saying is
that you don't read much code, except maybe your own. Isn't that your
problem? Those of us who cooperate widely on applications using
regular expressions will continue to communicate using REs. If that
leaves you out, that's not good. But adding VEs to the stdlib (and
thus encouraging their use) will split the community into RE users and
VE users, if VEs are at all useful. That's a bad. I don't see that
the potential usefulness of VEs to infrequent users of regular
expressions outweighing the downsides of "many ways to do it" in the
stdlib.

> Can we at least agree that baking special re syntax directly into
> the language is a bad idea?

I agree that there's no particular need for RE literals. If one wants
to mark an RE as some special kind of object, re.compile() does that
very well both by converting to a different type internally and as a
marker syntactically.

> On Wed, Mar 29, 2017 at 11:49 PM, Nick Coghlan <ncog...@gmail.com> wrote:
>
> > We don't really want to ease the use of regexps in Python - while
> > they're an incredibly useful tool in a programmer's toolkit,
> > they're so cryptic that they're almost inevitably a
> > maintainability nightmare.

I agree with Nick. Regular expressions, whatever the notation, are a
useful tool (no suspension of disbelief necessary for me, though!).
But they are cryptic, and it's not just the notation. People (even
experienced RE users) are often surprised by what fairly simple
regular expression match in a given text, because people want to read
a regexp as instructions to a one-pass greedy parser, and it isn't.

For example, above I wrote

scheme = "(https?|ftp|file):"

rather than

scheme = "(\w+):"

because it's not unlikely that I would want to treat those differently
from other schemes such as mailto, news, and doi. In many
applications of regular expressions (such as tokenization for a
parser) you need many expressions. Compactness really is a virtue in
REs.

Steve

Stephan Houben

unread,
Mar 31, 2017, 4:21:51 AM3/31/17
to Stephen J. Turnbull, python...@python.org
Hi all,

FWIW, I also strongly prefer the Verbal Expression style and consider
"normal" regular expressions to become quickly unreadable and
unmaintainable.

Verbal Expressions are also much more composable.

Stephan

2017-03-31 9:23 GMT+02:00 Stephen J. Turnbull
<turnbull....@u.tsukuba.ac.jp>:

Paul Moore

unread,
Mar 31, 2017, 4:27:39 AM3/31/17
to Stephan Houben, python...@python.org
On 31 March 2017 at 09:20, Stephan Houben <steph...@gmail.com> wrote:
> FWIW, I also strongly prefer the Verbal Expression style and consider
> "normal" regular expressions to become quickly unreadable and
> unmaintainable.

Do you publish your code widely? What's the view of 3rd party users of
your code? Until this thread, I'd never even heard of the Verbal
Expression style, and I read a *lot* of open source Python code. While
it's purely anecdotal, that suggests to me that the style isn't
particularly commonly used.

(OTOH, there's also a lot less use of REs in Python code than in other
languages. Much string manipulation in Python avoids using regular
languages at all, in my experience. I think that's a good thing - use
simpler tools when appropriate and keep the power tools for the hard
cases where they justify their complexity).

Paul

Stephen J. Turnbull

unread,
Apr 2, 2017, 3:30:37 PM4/2/17
to Stephan Houben, python...@python.org
Stephan Houben writes:

> FWIW, I also strongly prefer the Verbal Expression style and consider
> "normal" regular expressions to become quickly unreadable and
> unmaintainable.
>
> Verbal Expressions are also much more composable.

So are grammars.

But REs aren't so bad or incomposable if you build them up slowly in a
grammar-like fashion and with a specific convention for groups:

atom = r"[-%A-Za-z0-9]+" # incorrect, for example only
# each component has different lexical
# restrictions
scheme = user = password = rf"({atom})"
domain = rf"((?:{atom}\.)+{atom})"
port = r"([0-9]+)"
authority = rf"(?:{user}(?::{password})?@)?{domain}(?::{port})?"
path = rf"((?:/{atom})+/?)"

# Incorrect, but handles many common URIs.
url = rf"{scheme}://(?:{authority})?({path})"

Of course this is parsing with regular expressions, which is generally
frowned upon, and it would be even uglier without f-strings. The
non-capturing groups admittedly are a significant distraction when
reading. It's about the limit of what I would do if I didn't have a
parsing library but did have REs (more complex than this and I'd write
my own parser).

I will concede that it took me 15 minutes to write that, of which 4
were spent testing and fixing one bug (which was a real bug; there
were no syntax errors in the REs). Some of the time was spent
deciding how closely to follow the RFC 3986 generic syntax, though. I
will also concede that I've been writing REs since 1981, although not
as frequently in the last 15 years as in the first 20.

Would you like to write that using VEs and show us the result? Don't
forget to document the indicies for extracting the scheme, user,
password, domain, port, and path (in my RE, they are 1-6).

Neil Girdhar

unread,
Apr 2, 2017, 9:22:08 PM4/2/17
to python-ideas, turnbull....@u.tsukuba.ac.jp, python...@python.org, steph...@gmail.com
Same.  One day, Python will have a decent parsing library.

Mark Lawrence via Python-ideas

unread,
Apr 3, 2017, 2:31:02 AM4/3/17
to python...@python.org
On 03/04/2017 02:22, Neil Girdhar wrote:
> Same. One day, Python will have a decent parsing library.
>

Nothing here https://wiki.python.org/moin/LanguageParsing suits your needs?

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

Neil Girdhar

unread,
Apr 3, 2017, 4:26:08 AM4/3/17
to python...@googlegroups.com, python...@python.org
On Mon, Apr 3, 2017 at 2:31 AM Mark Lawrence via Python-ideas <python...@python.org> wrote:
On 03/04/2017 02:22, Neil Girdhar wrote:
> Same.  One day, Python will have a decent parsing library.
>

Nothing here https://wiki.python.org/moin/LanguageParsing suits your needs?

No, unfortunately.

I tried to make a simple grammar that parses latex code, and it was basically impossible with these tools.

From what I remember, you need the match objects to be able to accept or reject their matched sub-nodes.

It's same thing if you want to parse Python in one pass (not the usual two passes that CPython does whereby it creates an AST and then validates it).  It would be cooler to validate as you go since the errors can be much richer since you have the whole parsing context?

It's been a while, so I might be forgetting something, but I remember thinking that I'll check back in five years and see if anything new has come out.

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

_______________________________________________
Python-ideas mailing list
Python...@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

--

---
You received this message because you are subscribed to a topic in the Google Groups "python-ideas" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/python-ideas/FSd6xLHowg8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to python-ideas...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ryan Gonzalez

unread,
Apr 3, 2017, 8:55:33 AM4/3/17
to Neil Girdhar, python-ideas, python-ideas
Have you tried PyParsing and/or Grako? They're some of my favorites (well, I like PLY too, but I'm thinking you wouldn't like it too much).

--
Ryan (ライアン)
Yoko Shimomura > ryo (supercell/EGOIST) > Hiroyuki Sawano >> everyone else
http://refi64.com
To unsubscribe from this group and all its topics, send an email to python-ideas+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Neil Girdhar

unread,
Apr 3, 2017, 8:57:43 AM4/3/17
to Ryan Gonzalez, python-ideas, python-ideas
I've tried PyParsing.  I haven't tried Grako.

To unsubscribe from this group and all its topics, send an email to python-ideas...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Juancarlo Añez

unread,
Apr 3, 2017, 1:08:17 PM4/3/17
to Neil Girdhar, python-ideas, python-ideas

On Mon, Apr 3, 2017 at 8:57 AM, Neil Girdhar <miste...@gmail.com> wrote:
I've tried PyParsing.  I haven't tried Grako.

Caveat: I'm the author of Grako.

It's very easy to do complex parsing with Grako. The grammar can be embedded in a Python string, and the compiled grammar can be used for parsing without generating any Python code. Most of the unit tests under the distribution's grako/grako/test use those features.


One of the ways in which a top-down grammar (as those accepted by Grako) can be used is to organize a series of regular expressions into a tree to handle complex cases with clarity.


--
Juancarlo Añez
Reply all
Reply to author
Forward
0 new messages