should we change [^a-z] to <-[a..z]> instead of <-[a-z]>?

9 views
Skip to first unread message

Larry Wall

unread,
Apr 14, 2005, 8:21:05 PM4/14/05
to perl6-l...@perl.org
In writing some character class translation, I realized that

<-[a-z]>

and its ilk are rather hard to read because of the two hyphens
that mean different things. We can't use <![a-z]> because that's a
0-width lookahead. Given that we're trying to get rid of special
exceptions, and - in character classes is weird, and we already
use .. for ranges everywhere else, and nobody is going to put a
repeated character into a character class, I'm wondering if

<-[a..z]>

should be allowed/encouraged/required. It greatly improves the
readability in my estimation. The only problem with requiring .. is
that people *will* write <[a-z]> out of habit, and we would probably
have to outlaw the - form for many years before everyone would get
used to the .. form. So maybe we allow - but warn if not backslashed.

Larry

Darren Duncan

unread,
Apr 14, 2005, 9:40:00 PM4/14/05
to perl6-l...@perl.org

I don't see why the old syntax has to be supported at all.

Lots of other regexp details are already being changed, such as the
bounding '<>' and the removal of the leading internal '^', so people
already have to edit their regexps. So they can replace the '-' too
while they're at it; not very difficult.

Moreover, I often create character classes that have a literal '-' in
it, and it would be nice to not have to make that the last character
in the class for it to parse properly.

Also, the '..' is easy to learn because it is consistent with other
parts of Perl 6. Likewise, the consistency is another plus when
demonstrating what is good about Perl to folk who don't use it.

-- Darren Duncan

Patrick R. Michaud

unread,
Apr 14, 2005, 10:06:00 PM4/14/05
to perl6-l...@perl.org
On Thu, Apr 14, 2005 at 05:21:05PM -0700, Larry Wall wrote:
> Given that we're trying to get rid of special
> exceptions, and - in character classes is weird, and we already
> use .. for ranges everywhere else, and nobody is going to put a
> repeated character into a character class, I'm wondering if
>
> <-[a..z]>
>
> should be allowed/encouraged/required. It greatly improves the
> readability in my estimation.

So, <[a.z]> matches "a", ".", and "z",
while <[a..z]> matches characters "a" through "z" inclusive.

I think that works for me. I'll implement it that way (and yes, there
*are* updates to PGE coming very soon!).

I guess I can't complain too loudly about ".." over "-" for ranges
since I was the one who suggested replacing "," with ".." in quantifiers
(e.g., {1..3} instead of {1,3}). Not that I'd be complaining anyway. :-)

> The only problem with requiring .. is
> that people *will* write <[a-z]> out of habit, and we would probably
> have to outlaw the - form for many years before everyone would get
> used to the .. form. So maybe we allow - but warn if not backslashed.

Just to make sure I have it right, by "allow -" you mean that
<[a-z]> matches "a", "-", and "z" and produces a warning
about an unescaped '-'?

Pm

David Wheeler

unread,
Apr 15, 2005, 12:32:19 AM4/15/05
to Patrick R. Michaud, perl6-l...@perl.org
On Apr 14, 2005, at 7:06 PM, Patrick R. Michaud wrote:

> So, <[a.z]> matches "a", ".", and "z",
> while <[a..z]> matches characters "a" through "z" inclusive.

I was going to say that that was inconsistent, but since you never need
to repeat a letter in a character class, well, I guess it isn't. But
the first person to write <[a...]> gets what's comin' to 'em.

Regards,

David

--
David Wheeler
President, Kineticode, Inc.
http://www.kineticode.com/
Kineticode. Setting knowledge in motion.[sm]

Juerd

unread,
Apr 15, 2005, 6:18:12 AM4/15/05
to David Wheeler, Patrick R. Michaud, perl6-l...@perl.org
David Wheeler skribis 2005-04-14 21:32 (-0700):

> I was going to say that that was inconsistent, but since you never need
> to repeat a letter in a character class, well, I guess it isn't. But
> the first person to write <[a...]> gets what's comin' to 'em.

Given ASCII, <[\x20...]> would then be everything except control
characters. Handy!

By the way, does ...5 mean -Inf..5? ;)


Juerd
--
http://convolution.nl/maak_juerd_blij.html
http://convolution.nl/make_juerd_happy.html
http://convolution.nl/gajigu_juerd_n.html

Braňo Tichý

unread,
Apr 15, 2005, 8:33:54 AM4/15/05
to Aaron Sherman, perl6-l...@perl.org
----- Original Message -----
From: "Aaron Sherman" <a...@ajs.com>
To: "David Wheeler" <da...@kineticode.com>
Cc: "Perl6 Language List" <perl6-l...@perl.org>
Sent: Friday, April 15, 2005 2:00 PM
Subject: Re: should we change [^a-z] to <-[a..z]> instead of <-[a-z]>?


> On Thu, 2005-04-14 at 21:32 -0700, David Wheeler wrote:
> > On Apr 14, 2005, at 7:06 PM, Patrick R. Michaud wrote:
> >
> > > So, <[a.z]> matches "a", ".", and "z",
> > > while <[a..z]> matches characters "a" through "z" inclusive.
> >
> > I was going to say that that was inconsistent, but since you never need
> > to repeat a letter in a character class, well, I guess it isn't. But
> > the first person to write <[a...]> gets what's comin' to 'em.
>

> A silly question: is there a canonical character set from which we
> extract these ranges? Are we hard-coding Unicode here, or is there some
> way for the user to specify the character set for ranges?
>

<delurk>
even sillier question:
if <[a.z]> matches "a", "." and "z"
and <[a...]> matches all characters from "a" including (for some definition
of 'all')

how will be range \x21 .. \x2e written?
<[!..\.]>? (i.e. "." escaped?)
</delurk>

braňo

Aaron Sherman

unread,
Apr 15, 2005, 8:00:31 AM4/15/05
to David Wheeler, Perl6 Language List
On Thu, 2005-04-14 at 21:32 -0700, David Wheeler wrote:
> On Apr 14, 2005, at 7:06 PM, Patrick R. Michaud wrote:
>
> > So, <[a.z]> matches "a", ".", and "z",
> > while <[a..z]> matches characters "a" through "z" inclusive.
>
> I was going to say that that was inconsistent, but since you never need
> to repeat a letter in a character class, well, I guess it isn't. But
> the first person to write <[a...]> gets what's comin' to 'em.

A silly question: is there a canonical character set from which we

Matthew Walton

unread,
Apr 15, 2005, 9:20:48 AM4/15/05
to ti...@dss.sk, a...@ajs.com, perl6-l...@perl.org

> <delurk>
> even sillier question:
> if <[a.z]> matches "a", "." and "z"
> and <[a...]> matches all characters from "a" including (for some
> definition of 'all')
>
> how will be range \x21 .. \x2e written?
> <[!..\.]>? (i.e. "." escaped?)
> </delurk>

I was assuming from Larry's mail that <[a...]> would parse as either:

1) a character class containing the range from 'a' to '.' (what that
means is a bit mind-bending for a friday afternoon) 2) a character class containing 'a' then a range from '.' to... oh, an
error
Which way might be ambiguous, but could of course be defined in the
grammar. It hadn't occurred to me that ... for the range to infinity would
be allowed or useful here. I suppose it could just mean 'up to the end of
the available codepoints'.
I do love the idea of <[a..f]> type ranges though. It's just what the
three dots mean that's got me confused.

Patrick R. Michaud

unread,
Apr 15, 2005, 10:36:18 AM4/15/05
to Rafael Garcia-Suarez, perl6-l...@perl.org
On Fri, Apr 15, 2005 at 01:01:58PM -0000, Rafael Garcia-Suarez wrote:
> Aaron Sherman wrote in perl.perl6.language :

> >
> > A silly question: is there a canonical character set from which we
> > extract these ranges? Are we hard-coding Unicode here, or is there some
> > way for the user to specify the character set for ranges?
>
> Perl 5 forces [a-z] (or [i-j] for that matter) to be a range of
> lowercase alphabetic characters, even on EBCDIC platforms (where it's
> not).

At the moment, PGE (the part that implements the rule engine) is
deferring such questions to Parrot, and otherwise assuming Unicode.
Plus, S02 explicitly indicates that Perl is written in Unicode
and has consistent Unicode semantics, so I think that's what we should
go with. It's certainly the way the compiler will go, at least
initially.

Pm

Steven Philip Schubiger

unread,
Apr 15, 2005, 6:33:28 AM4/15/05
to perl6-l...@perl.org
On 14 Apr, Larry Wall wrote:

: In writing some character class translation, I realized that

I think, if we bear in mind, as it has been stressed previously, that
many changes concerning regular expressions have been introduced and
require users to assimilate themselves accordingly, it doesn't seem
unreasonable requiring to write double-dot instead of a hyphen; it also
fits the "Principle of least surprise" idiom nicely, in my opinion.

Nevertheless, as mentioned by David, <[a...]> would become rather
confusing to people first and secondly to the compiler; although,
regardless whether we assume dot preceeds double-dot or vice-versa,
there would be an expansion enforced (what I'd expect), perhaps
accompanied by a warning.

I agree on a warning upon non-escaped hyphen.

Steven

Rafael Garcia-Suarez

unread,
Apr 15, 2005, 9:01:58 AM4/15/05
to perl6-l...@perl.org
Aaron Sherman wrote in perl.perl6.language :
>
> A silly question: is there a canonical character set from which we
> extract these ranges? Are we hard-coding Unicode here, or is there some
> way for the user to specify the character set for ranges?

Perl 5 forces [a-z] (or [i-j] for that matter) to be a range of

Rod Adams

unread,
Apr 15, 2005, 12:28:31 PM4/15/05
to perl6-l...@perl.org
David Wheeler wrote:

> But the first person to write <[a...]> gets what's comin' to 'em.

Is that nothing (since '.' lt 'a'), or everything after 'a'?

-- Rod Adams

Larry Wall

unread,
Apr 15, 2005, 2:59:22 PM4/15/05
to perl6-l...@perl.org

Might as well make it everything after 'a' for consistency. One could
also view the last dot as a special version of the ordinary "any" dot,
and read it "a to whatever".

Larry

Joe Gottman

unread,
Apr 17, 2005, 1:35:33 PM4/17/05
to perl6-l...@perl.org

> -----Original Message-----
> From: Paul Hodges [mailto:ydb...@yahoo.com]
> Sent: Sunday, April 17, 2005 1:30 PM
> To: Larry Wall; perl6-l...@perl.org
> Subject: Re: should we change [^a-z] to <-[a..z]> instead of <-[a-z]>?
>
>

> --- Larry Wall <la...@wall.org> wrote:
> . . .


> > <-[a..z]>
> >
> > should be allowed/encouraged/required. It greatly improves the
> > readability in my estimation. The only problem with requiring .. is
> > that people *will* write <[a-z]> out of habit, and we would probably
> > have to outlaw the - form for many years before everyone would get
> > used to the .. form. So maybe we allow - but warn if not
> > backslashed.
>

> In general, I think this is a great idea, but what exactly do you mean
> by "warn if not backslashed"? That I'd get a warning *any* time I use a
> dash in a character class? I guess I can live with that.

On the other hand, you can use the canonical perl 5 trick of having the
dash be the first character in the class if you want to use a literal dash.

Joe Gottman.

Paul Hodges

unread,
Apr 17, 2005, 1:25:18 PM4/17/05
to Larry Wall, perl6-l...@perl.org

--- Larry Wall <la...@wall.org> wrote:

I think that if we're looking for consistency, the default should be to
read it as "a and everything after it". If someone wants "a to
whatever", they should write it <[a..\.]> since it's a pretty odd
fringe case.


__________________________________
Do you Yahoo!?
Plan great trips with Yahoo! Travel: Now over 17,000 guides!
http://travel.yahoo.com/p-travelguide

Paul Hodges

unread,
Apr 17, 2005, 1:29:35 PM4/17/05
to Larry Wall, perl6-l...@perl.org

--- Larry Wall <la...@wall.org> wrote:
. . .
> <-[a..z]>
>
> should be allowed/encouraged/required. It greatly improves the
> readability in my estimation. The only problem with requiring .. is
> that people *will* write <[a-z]> out of habit, and we would probably
> have to outlaw the - form for many years before everyone would get
> used to the .. form. So maybe we allow - but warn if not
> backslashed.

In general, I think this is a great idea, but what exactly do you mean
by "warn if not backslashed"? That I'd get a warning *any* time I use a
dash in a character class? I guess I can live with that.

Reply all
Reply to author
Forward
0 new messages