Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

\x{123a 123b 123c}

11 views
Skip to first unread message

Ruud H.G. van Tol

unread,
Nov 19, 2005, 7:19:47 PM11/19/05
to perl6-l...@perl.org
Maybe

"\x{123a 123b 123c}"

is a nice alternative of

"\x{123a} \x{123b} \x{123c}".

--
Grtz, Ruud

Juerd

unread,
Nov 19, 2005, 7:26:21 PM11/19/05
to perl6-l...@perl.org
Ruud H.G. van Tol skribis 2005-11-20 1:19 (+0100):

> Maybe
> "\x{123a 123b 123c}"
> is a nice alternative of
> "\x{123a} \x{123b} \x{123c}".

Hmm, very cute and friendly! Can we keep it, please? Please?


Juerd
--
http://convolution.nl/maak_juerd_blij.html
http://convolution.nl/make_juerd_happy.html
http://convolution.nl/gajigu_juerd_n.html

Larry Wall

unread,
Nov 19, 2005, 9:32:17 PM11/19/05
to perl6-l...@perl.org
On Sun, Nov 20, 2005 at 01:26:21AM +0100, Juerd wrote:
: Ruud H.G. van Tol skribis 2005-11-20 1:19 (+0100):

: > Maybe
: > "\x{123a 123b 123c}"
: > is a nice alternative of
: > "\x{123a} \x{123b} \x{123c}".
:
: Hmm, very cute and friendly! Can we keep it, please? Please?

We already have, from A5, \x[0a;0d], so you can supposedly say

"\x[123a;123b;123c]"

Note that square brackets are now the normative style though, since we're
trying to reserve curlies psychologically for closures.

But I see that the semicolon is rather cluttery, mainly because it's
too tall. I'm not sure going all the way to space is good, but we
might have

"\x[123a,123b,123c]"

just to get a little visual space along with the separator. My problem
with space is that it has potential visual confusion with character
classes (especially with the square brackets), and it also will make
people wonder whether :w should match optional whitespace between
the characters. The commas seems to imply sequence to me, and they
occur often enough that you can see it's not a well-formed character
class, insofar as it has repeated characters.

It occurs to me that we didn't spec whether character classes ignore
whitespace. They probably should, just so you can chunk things:

/ <[ a..z A..Z 0..9 _ ]> /

Then the question arises about whether <[ \ ]> is an escaped space
or a backslash, or illegal But if we make it match a backslash
or illegal, then the minimal space matcher becomes \x20, I think,
unless you graduate to \s. On the other hand, if we make it match
a space, people aren't going to read that way unless they're pretty
sophisticated...

Larry

Patrick R. Michaud

unread,
Nov 20, 2005, 11:27:17 AM11/20/05
to perl6-l...@perl.org
On Sat, Nov 19, 2005 at 06:32:17PM -0800, Larry Wall wrote:
> On Sun, Nov 20, 2005 at 01:26:21AM +0100, Juerd wrote:
> : Ruud H.G. van Tol skribis 2005-11-20 1:19 (+0100):
> : > Maybe
> : > "\x{123a 123b 123c}"
> : > is a nice alternative of
> : > "\x{123a} \x{123b} \x{123c}".
>
> We already have, from A5, \x[0a;0d], so you can supposedly say
> "\x[123a;123b;123c]"

Hmm, I hadn't caught that particular syntax in A05. AFAIK it's not
in S05, so I should probably add it, or whatever syntax we end up
adopting.

(BTW, we haven't announced it on p6l yet, but there's a new version of
S05 available.)

> [...]


> But I see that the semicolon is rather cluttery, mainly because it's
> too tall. I'm not sure going all the way to space is good, but we
> might have
> "\x[123a,123b,123c]"
> just to get a little visual space along with the separator.

Just to verify, with this syntax would we expect

\x[123a,123b,123c]+

to be the same as

[\x123a \x123b \x123c]+

and not "\x123a \x123b \x123c+" ?

> It occurs to me that we didn't spec whether character classes ignore
> whitespace. They probably should, just so you can chunk things:
>
> / <[ a..z A..Z 0..9 _ ]> /
>
> Then the question arises about whether <[ \ ]> is an escaped space
> or a backslash, or illegal

I vote that it's an escaped space. A backslash is nearly always \\
(or should be imho).

> But if we make it match a backslash
> or illegal, then the minimal space matcher becomes \x20, I think,
> unless you graduate to \s. On the other hand, if we make it match
> a space, people aren't going to read that way unless they're pretty
> sophisticated...

There's also <sp>, unless someone redefines the <sp> subrule.
And in the general case that's a slightly more expensive mechanism
to get a space (it involves at least a subrule lookup). Perhaps
we could also create a visible meta sequence for it, in the same
way that we have visible metas for \e, \f, \r, \t. But I have
no idea what letter we might use there.

I don't think I like this, but perhaps C<< <> >> becomes <?null>
and C<< < > >> becomes <' '>? Seems like not enough visual distinction
there...

Pm

TSa

unread,
Nov 21, 2005, 9:23:35 AM11/21/05
to perl6-l...@perl.org
HaloO,

Patrick R. Michaud wrote:
> There's also <sp>, unless someone redefines the <sp> subrule.
> And in the general case that's a slightly more expensive mechanism
> to get a space (it involves at least a subrule lookup). Perhaps
> we could also create a visible meta sequence for it, in the same
> way that we have visible metas for \e, \f, \r, \t. But I have
> no idea what letter we might use there.

How about \x and \X respectively? Note the *space* after it :)
I mean that much more serious than it might sound err read.
I hope the concept of unwritten things in the source beeing
interesting values of void/undef applies always.

OTOH, I'm usually not saying anything in the area of the grammar
subsystem, but I still try to wrap my brain around the underlying
unifyed conceptual level where rules and methods or subs and macros
are indistinguishable. So, please consider this as a well wanting
question. And please forgive the syntax errors.

With something like

# or token? perhaps even sub?
macro x ( HexLiteral *[$char = 32, *@more] )
is parsed( <HexLiteral>* )
{...}

and \ in match strings escaping out to the macro level when
the circumfix match creator is invoked, I would expect

m/ \x /; # single space is required
m/ \x20 /; # same
m/ <{x}> /; # same?
m/ \X /; # any single char except space
m/ \x\x\x /; # exactly three spaces
m/ \x[20,20,20] /; # same, as proposed by Larry
m/ \xy /; # parse error 'y not a hex digit'
m/ \x y /; # one space then y

to insert verbatim, machine level chars into the match definition.
In particular *no* lookup is compiled in.

I would call \x the single character *exact* matcher and \X
the *excluder*. BTW, the definition of the latter could just be

&X ::= !&x; # or automagically defined by up-casing and outer negation

if ? and ! play in the meta operator league.


> I don't think I like this, but perhaps C<< <> >> becomes <?null>
> and C<< < > >> becomes <' '>? Seems like not enough visual distinction
> there...

I strongly agree. I would ask the moot question *how* the single space
in / / is removed ---as leading, trailing or separating space---when the
parser goes over it. But I would never expect the source space to make it
into the compiled match code!
--

Patrick R. Michaud

unread,
Nov 21, 2005, 9:32:51 AM11/21/05
to TSa, perl6-l...@perl.org
On Mon, Nov 21, 2005 at 03:23:35PM +0100, TSa wrote:
> Patrick R. Michaud wrote:
> >There's also <sp>, unless someone redefines the <sp> subrule.
> >And in the general case that's a slightly more expensive mechanism
> >to get a space (it involves at least a subrule lookup). Perhaps
> >we could also create a visible meta sequence for it, in the same
> >way that we have visible metas for \e, \f, \r, \t. But I have
> >no idea what letter we might use there.
>
> How about \x and \X respectively? Note the *space* after it :)
> ...

If we're going to do that, I'd think it would be "\c " and "\C "
instead of "\x " and "\X ". I'm not really advocating this,
I'm just commenting that in this case \c seems more natural
than \x.

Pm

Ruud H.G. van Tol

unread,
Nov 21, 2005, 11:49:59 AM11/21/05
to perl6-l...@perl.org
Larry Wall:
> Juerd:
>> Ruud:

>>> Maybe
>>> "\x{123a 123b 123c}"
>>> is a nice alternative of
>>> "\x{123a} \x{123b} \x{123c}".
>>
>> Hmm, very cute and friendly! Can we keep it, please? Please?

Thanks for the support.


> We already have, from A5, \x[0a;0d], so you can supposedly say
> "\x[123a;123b;123c]"

<rereading apo5 />
Found it in the old/new table on page 7. For me the semicolon is fine.

I am using character names more and more, and between those, semicolons
are less cluttery. Character names can contain spaces, but semicolons
too? If not then
\c[BEL; EXTENDED ARABIC-INDIC DIGIT ZERO] would be possible, but maybe
better not, or more like
\c['BEL'; 'EXTENDED ARABIC-INDIC DIGIT ZERO'] or even
\c('BEL', 'EXTENDED ARABIC-INDIC DIGIT ZERO').

Something else:
The '^' could be used for both the ultimate start- and end-of-string.
This frees the '$'.

There is still the '$$' that matches before embedded newlines, and since
'^^' matches after those newlines, the '^^' and '$$' can only be unified
to '^^' if it is one-width inside a string, so is like '[$$\n^^]' (or
just '\n') there.
At start- and end-of-string the '^^' can still be a zero-width match.
I am not sure about greedy (meaning to try one-width first) or
non-greedy.

Example: '^[(\N*)^^]*^' to capture all lines, clean of newlines.
Not a lot clearer than '^[(\N*)\n*]*$', but freeing the '$' and '$$'
might be worth it.

<mess about '^^+', '^+^' and '^*^' (bats!) removed>

--
Affijn, Ruud

"Gewoon is een tijger."

Larry Wall

unread,
Nov 21, 2005, 12:02:57 PM11/21/05
to perl6-l...@perl.org
On Sun, Nov 20, 2005 at 10:27:17AM -0600, Patrick R. Michaud wrote:

: On Sat, Nov 19, 2005 at 06:32:17PM -0800, Larry Wall wrote:
: > On Sun, Nov 20, 2005 at 01:26:21AM +0100, Juerd wrote:
: > : Ruud H.G. van Tol skribis 2005-11-20 1:19 (+0100):
: > : > Maybe
: > : > "\x{123a 123b 123c}"
: > : > is a nice alternative of
: > : > "\x{123a} \x{123b} \x{123c}".
: >
: > We already have, from A5, \x[0a;0d], so you can supposedly say
: > "\x[123a;123b;123c]"
:
: Hmm, I hadn't caught that particular syntax in A05. AFAIK it's not
: in S05, so I should probably add it, or whatever syntax we end up
: adopting.

Yes.

: (BTW, we haven't announced it on p6l yet, but there's a new version of
: S05 available.)

Indeed, there are new versions of most of the S's. People who want the
latest should use svn.perl.org, which also makes it easy to do diff listings
with svn or svk.

: > [...]


: > But I see that the semicolon is rather cluttery, mainly because it's
: > too tall. I'm not sure going all the way to space is good, but we
: > might have
: > "\x[123a,123b,123c]"
: > just to get a little visual space along with the separator.
:
: Just to verify, with this syntax would we expect
:
: \x[123a,123b,123c]+
:
: to be the same as
:
: [\x123a \x123b \x123c]+
:
: and not "\x123a \x123b \x123c+" ?

Yes. I think the rule interpretation of \x is that it is a sequence to
be considered a single character regardless of its context. Certainly
the square brackets we've mandated would tend to read as grouping anyway.

Of course, the main point of the \x[a,b,c] notation is to allow
interpolation of sequences of hex characters into ordinary strings,
and those don't care about abstract character boundaries.

: > It occurs to me that we didn't spec whether character classes ignore


: > whitespace. They probably should, just so you can chunk things:
: >
: > / <[ a..z A..Z 0..9 _ ]> /
: >
: > Then the question arises about whether <[ \ ]> is an escaped space
: > or a backslash, or illegal
:
: I vote that it's an escaped space. A backslash is nearly always \\
: (or should be imho).
:
: > But if we make it match a backslash
: > or illegal, then the minimal space matcher becomes \x20, I think,
: > unless you graduate to \s. On the other hand, if we make it match
: > a space, people aren't going to read that way unless they're pretty
: > sophisticated...
:
: There's also <sp>, unless someone redefines the <sp> subrule.

But you can't use <sp> in a character class. Well, that is, unless
you write it:

<+[ a..z ]+<sp>>

or some such. Maybe that's good enough.

: And in the general case that's a slightly more expensive mechanism

: to get a space (it involves at least a subrule lookup). Perhaps
: we could also create a visible meta sequence for it, in the same
: way that we have visible metas for \e, \f, \r, \t. But I have
: no idea what letter we might use there.

Something to be said for \_ in that regard.

: I don't think I like this, but perhaps C<< <> >> becomes <?null>

: and C<< < > >> becomes <' '>? Seems like not enough visual distinction
: there...

<_> maybe. I'm good with <> being <?null>, and <,> being element boundary
when matching lists. But I'd like to reserve < > for delimiting what
is returned by $<>, the string officially matched:

"foo bar baz" ~~ /:w foo < \w+ > baz/
say $/; # foo bar baz
say $<>; # bar

Or possibly

"foo bar baz" ~~ /:w foo << \w+ >> baz/

but that should probably mean whatever

"foo bar baz" ~~ /:w foo « \w+ » baz/

eventually means. Which I haven't the foggiest. But we should probably
reserve the brackets on general principle's sake, just because brackets
are so scarce.

I dunno. If «...» in ordinary code does shell quoting, maybe «...» in
rules does filename globbing or some such. I can see some issues with
anchoring semantics. Makes more sense on a string as a whole, but maybe
can anchor on element boundaries if used on a list of filenames.
I suppose one could even go as far as

rule jpeg :i « *.jp{e,}g »

or whatever the right glob syntax is.

Larry

Larry Wall

unread,
Nov 21, 2005, 12:33:07 PM11/21/05
to perl6-l...@perl.org
On Mon, Nov 21, 2005 at 09:02:57AM -0800, Larry Wall wrote:
: But I'd like to reserve < > for delimiting what is returned by $<>,

: the string officially matched:
:
: "foo bar baz" ~~ /:w foo < \w+ > baz/
: say $/; # foo bar baz
: say $<>; # bar

Though it occurs to me that there's another possible interpretation,
culturally speaking. The overloading of \b has always bothered me,
plus the fact that \b can't distinguish which kind of word boundary
without additional context. In regex culture, we have the \<...\>
word matcher, and maybe that devolves to isolated < ... > in rules.

We could still use << ... >> to capture $<>, which I was leaning toward
anyway just for visibility reasons, since the two ends could be quite
far apart.

And file globbing could just be :glob or some such if we really need
to embed it in rules.

Larry

Larry Wall

unread,
Nov 21, 2005, 12:28:03 PM11/21/05
to perl6-l...@perl.org
On Mon, Nov 21, 2005 at 05:49:59PM +0100, Ruud H.G. van Tol wrote:
: Larry Wall:

: > Juerd:
: >> Ruud:
:
: >>> Maybe
: >>> "\x{123a 123b 123c}"
: >>> is a nice alternative of
: >>> "\x{123a} \x{123b} \x{123c}".
: >>
: >> Hmm, very cute and friendly! Can we keep it, please? Please?
:
: Thanks for the support.

Hey, this ain't exactly a popularity contest here... :-)

: > We already have, from A5, \x[0a;0d], so you can supposedly say


: > "\x[123a;123b;123c]"
:
: <rereading apo5 />
: Found it in the old/new table on page 7. For me the semicolon is fine.

The fact that you say "page 7" leads me to guess that you're reading
it from perl.com. That's going to be the most out-of-date version.
Better would be

dev.perl.org one day latency but html-ified
svn.perl.org up to the minute but only in pod

In particular, the Apocalypses have little [Update:] sections that are
supposed to alert you to things that have changed since the the Apo
was written. (Though some of those are a little out of date right now
too--I'm just working my way through A12 again.)

: I am using character names more and more, and between those, semicolons


: are less cluttery. Character names can contain spaces, but semicolons
: too? If not then
: \c[BEL; EXTENDED ARABIC-INDIC DIGIT ZERO] would be possible, but maybe
: better not, or more like
: \c['BEL'; 'EXTENDED ARABIC-INDIC DIGIT ZERO'] or even
: \c('BEL', 'EXTENDED ARABIC-INDIC DIGIT ZERO').

None of the current names contain either semicolon or comma, so I expect
they're avoiding them by policy.

: Something else:


: The '^' could be used for both the ultimate start- and end-of-string.
: This frees the '$'.

I think this is one of those aspects of regex culture that is too
entrenched to remove. Besides, you have to be able to distinguish
s/^/foo/ from s/$/foo/.

: There is still the '$$' that matches before embedded newlines, and since


: '^^' matches after those newlines, the '^^' and '$$' can only be unified
: to '^^' if it is one-width inside a string, so is like '[$$\n^^]' (or
: just '\n') there.

But then if you use it within a capture, you get an extra newline you
probably don't want.

: At start- and end-of-string the '^^' can still be a zero-width match.


: I am not sure about greedy (meaning to try one-width first) or
: non-greedy.
:
: Example: '^[(\N*)^^]*^' to capture all lines, clean of newlines.
: Not a lot clearer than '^[(\N*)\n*]*$', but freeing the '$' and '$$'
: might be worth it.

I don't think it's any clearer. In fact, I find all the ^'s there
are a little too visually confusing and contextual.

Larry

Ruud H.G. van Tol

unread,
Nov 21, 2005, 1:57:59 PM11/21/05
to perl6-l...@perl.org
Larry Wall:
> Ruud H.G. van Tol:


> dev.perl.org one day latency but html-ified
> svn.perl.org up to the minute but only in pod

Thanks, much better. Can't say that I haven't been there before.

There is a "[[:alpha:][:digit:]" and a "[[:alpha:][:digit]]" on the
A5-page.


>> The '^' could be used for both the ultimate start- and end-of-string.
>> This frees the '$'.
>
> I think this is one of those aspects of regex culture that is too
> entrenched to remove.

Yes, I have experienced that with some of my procmail-recipes that use
'^' to match embedded newlines.
In procmail the '^^' matches begin- or end-of-string. Both a '^' and a
'$' can be used to match a real or putative newline. Some people
replaced my '^'s with '$'s.

OK, everybody can stop reading here, no serious attempts below.

"Within C++, there is a much smaller and cleaner language struggling to
get out," which "would ... have been an unimportant cult language."
(Bjarne Stroustrup, The Design and Evolution of C++).


> Besides, you have to be able to distinguish
> s/^/foo/ from s/$/foo/.

's/$/foo/' becomes 's/<after .*>/foo/'
<g>


>> There is still the '$$' that matches before embedded newlines, and
>> since '^^' matches after those newlines, the '^^' and '$$' can only
>> be unified to '^^' if it is one-width inside a string, so is like
>> '[$$\n^^]' (or just '\n') there.
>
> But then if you use it within a capture, you get an extra newline you
> probably don't want.

Place the ^^ outside the ().

I wasn't sure about the default for the greediness of '^^' at begin- or
end-of-string, I guess non-greediness can be arranged with a trailing
'?'.


>> At start- and end-of-string the '^^' can still be a zero-width match.
>> I am not sure about greedy (meaning to try one-width first) or
>> non-greedy.
>>
>> Example: '^[(\N*)^^]*^' to capture all lines, clean of newlines.
>> Not a lot clearer than '^[(\N*)\n*]*$', but freeing the '$' and '$$'
>> might be worth it.
>
> I don't think it's any clearer.

Pardon my Dutch, I didn't find it clearer either ("but, might be worth
it").


> In fact, I find all the ^'s there
> are a little too visually confusing and contextual.

/^ # BoS
[ # start of non-capturing group
(\N*) # capture a substring of non-newlines
^^ # newline or EoS
]* # end of non-capturing group, repeat
^/x # EoS

As I just said, I am used to '^^' as start- and end-of-buffer, and '^'
as matching a real or putative newline, because of procmail.

--
Grtz, Ruud

Larry Wall

unread,
Nov 21, 2005, 3:08:08 PM11/21/05
to perl6-l...@perl.org
On Mon, Nov 21, 2005 at 07:57:59PM +0100, Ruud H.G. van Tol wrote:
: There is a "[[:alpha:][:digit:]" and a "[[:alpha:][:digit]]" on the
: A5-page.

Hmm, well, thanks--I went to fix it and I see Patrick beat me to
the fix. But in one of the updates, it says:

+[Update: Actually, that's now written C<< <+alpha+digit> >>, avoiding
+the mistaken impression entirely.]

And it occurs to me that we could probably allow <alpha+digit> there
since there's no ambiguity what <alpha means, and we're already claiming
the next character after the opening word to decide how to process the
rest of the text inside angles. Even if someone writes

<alpha + digit>

that would fail under the current policy of treating "+ digit" as rule,
since you can't start a rule with +.

Unfortunately, though,

<identchar - digit>

would be ambiguous, and/or wrong. Could allow whitespace there if we
picked an explicit "this is rule" character. Did we remove "this is
string"? If so, we could swipe the colon:

<after: --help>

Could put back "this is string" with explicit quotes:

<after '--help'>

but that doesn't save much over

<after('--help')>

which is partly why we removed "this is string" in the first place.

Larry

Ruud H.G. van Tol

unread,
Nov 21, 2005, 4:32:53 PM11/21/05
to perl6-l...@perl.org
Larry Wall:

> in one of the updates, it says:
>
> +[Update: Actually, that's now written C<< <+alpha+digit> >>,
> avoiding +the mistaken impression entirely.]

In dev's A05.html I only found:
"[Update: That must now be written <+<alpha>+<digit>>, or it will be
mistaken for «alpha><digit», which doesn't work too well.]".

I see those character classes as infinite sort-of-binary masks, so
<alpha|digit> looks right to me.
Idem <[_] | alpha | digit & !Swedish>, with left-to-right application.
(I don't oversee the consequences.)

--
Grtz, Ruud (sober.u is on the loose)

Ruud H.G. van Tol

unread,
Nov 21, 2005, 5:19:48 PM11/21/05
to perl6-l...@perl.org
Patrick R. Michaud:

>> 's/$/foo/' becomes 's/<after .*>/foo/'
>> <g>
>

> Uh, no, because <after> is still a zero width assertion. :-)


That's why I chose it. It is not at the end-of-string?

perl5 -e '$_="abc"; s/(?<=...)/x/; print'

perl5 -e '$_="abc"; s/(?!.)/x/; print'

's/<!before .>/foo/'

--
Grtz, Ruud

Juerd

unread,
Nov 21, 2005, 5:43:31 PM11/21/05
to perl6-l...@perl.org
Larry Wall skribis 2005-11-21 12:08 (-0800):

> Unfortunately, though,
> <identchar - digit>
> would be ambiguous, and/or wrong.

Well, we could of course change "-" to mean "-1 or fewer", as "+" means
"+1 or more"... :D

Ruud H.G. van Tol

unread,
Nov 21, 2005, 7:09:40 PM11/21/05
to perl6-l...@perl.org
Patrick R. Michaud:
> Ruud H.G. van Tol:
>> Patrick R. Michaud:
>>> Ruud H.G. van Tol:

>>>> 's/$/foo/' becomes 's/<after .*>/foo/'
>>>

>>> Uh, no, because <after> is still a zero width assertion. :-)
>>
>> That's why I chose it. It is not at the end-of-string?
>

> Because ".*" matches "", /<after .*>/ would be true at
> every position in the string, including the beginning,
> and this is where "foo" would be substituted.

I expected greediness, also because <after .*?> could behave non-greedy.

Just like:
s/(.*)/$1foo/
s/(.*?)/$1foo/

OK, so 's/<!before .>/foo/' it must be.

But why does <after .*> behave non-greedy?

--
Grtz, Ruud

Ruud H.G. van Tol

unread,
Nov 21, 2005, 11:07:07 PM11/21/05
to perl6-l...@perl.org
Patrick R. Michaud:
> Ruud H.G. van Tol:

>>>>>> 's/$/foo/' becomes 's/<after .*>/foo/'
>>>>>
>>>>> Uh, no, because <after> is still a zero width assertion. :-)
>>>>
>>>> That's why I chose it. It is not at the end-of-string?
>>>
>>> Because ".*" matches "", /<after .*>/ would be true at
>>> every position in the string, including the beginning,
>>> and this is where "foo" would be substituted.
>>
>> I expected greediness, also because <after .*?> could behave

>> non-greedy. ...


>> But why does <after .*> behave non-greedy?
>

> I think you may be misreading what <after .*> does -- it's a
> lookbehind assertion.

No, I was no longer misreading it, I was questioning its rationale. I
wondered what would be lost if the construct would behave more like
's/(.*)/$1foo/'. Sorry for not making that more explicit. I was still
getting rid of the '$'. And monitoring the outbreak of sober.u.


> The greediness of the .* subpattern in <after .*> doesn't affect
> things at all -- <after .*> is still a zero-width assertion.

There is a zero-width 'slot' before (and after) each character in the
pattern string. As a zero-width assertion, '<after .*>' has no sense, no
'self', since it can't move the match position to another slot.

In '<after ab*>', the 'b*' means nothing.
In '<after ab+>', the '+' means nothing.
In '<after .*a>', the '.*' means nothing.

Unless the meaning of '<after .*a>' would be changed to: try the last
'a' first.

--
Grtz, Ruud

Patrick R. Michaud

unread,
Nov 21, 2005, 3:27:16 PM11/21/05
to perl6-l...@perl.org
On Mon, Nov 21, 2005 at 12:08:08PM -0800, Larry Wall wrote:
> On Mon, Nov 21, 2005 at 07:57:59PM +0100, Ruud H.G. van Tol wrote:
> : There is a "[[:alpha:][:digit:]" and a "[[:alpha:][:digit]]" on the
> : A5-page.
>
> Hmm, well, thanks--I went to fix it and I see Patrick beat me to
> the fix. But in one of the updates, it says:
>
> +[Update: Actually, that's now written C<< <+alpha+digit> >>, avoiding
> +the mistaken impression entirely.]

I went ahead and added the update while fixing the typos. :-)

> And it occurs to me that we could probably allow <alpha+digit> there
> since there's no ambiguity what <alpha means, and we're already claiming
> the next character after the opening word to decide how to process the
> rest of the text inside angles. Even if someone writes
>
> <alpha + digit>
>
> that would fail under the current policy of treating "+ digit" as rule,
> since you can't start a rule with +.

Somehow I prefer the explicit leading + or -, so that we *know* this
is a rule composition of some sort. It also fits in well with the
convention that the first character after the '<' lets you know
what kind of assertion is being created.

> Unfortunately, though,
>
> <identchar - digit>
>
> would be ambiguous, and/or wrong. Could allow whitespace there if we
> picked an explicit "this is rule" character. Did we remove "this is
> string"?

I didn't recall seeing anything that removed "this is string", so it's
currently implemented in PGE. It's kind of a nice shortcut:

<bracketed: []()>

but it would be no real problem to eliminate it and go
strictly with:

<bracketed('[]()')>

"This is rule" is currently whitespace, whatever follows is taken to be
a pattern.

But let me know what you decide so I can make the appropriate
changes. :-)

Pm

Patrick R. Michaud

unread,
Nov 21, 2005, 12:25:20 PM11/21/05
to perl6-l...@perl.org
On Mon, Nov 21, 2005 at 09:02:57AM -0800, Larry Wall wrote:
> : There's also <sp>, unless someone redefines the <sp> subrule.
>
> But you can't use <sp> in a character class. Well, that is, unless
> you write it:
>
> <+[ a..z ]+<sp>>
>
> or some such. Maybe that's good enough.

Er, that's now <+[ a..z ]+sp>, unless you're now changing it back.

> : And in the general case that's a slightly more expensive mechanism
> : to get a space (it involves at least a subrule lookup). Perhaps
> : we could also create a visible meta sequence for it, in the same
> : way that we have visible metas for \e, \f, \r, \t. But I have
> : no idea what letter we might use there.
>
> Something to be said for \_ in that regard.

Yes, I thought of \_ but mentally I still have trouble
classifying "_" along with the alphabetics -- '_' looks more
like punctuation to me. And in general we use backslashes
in front of metacharacters to remove their meta meaning
(or when we aren't sure if a character has a meta meaning),
so that \_ somehow seems like it ought to be a literal
underscore, guarding against the possibility that the unescaped
underscore has a meta meaning. (And yes, I can shoot
holes in this line of thinking along with everyone else.)

Whatever shortcuts we introduce, I'll be happy if we can just
rule that backslash+space (i.e., "\ ") is a literal space
character -- i.e., keeping the principle that placing a backslash
in front of a metacharacter removes that character's "meta"
behavior.

> I dunno. If «...» in ordinary code does shell quoting, maybe «...» in
> rules does filename globbing or some such. I can see some issues with
> anchoring semantics. Makes more sense on a string as a whole, but maybe
> can anchor on element boundaries if used on a list of filenames.
> I suppose one could even go as far as
>
> rule jpeg :i « *.jp{e,}g »
>
> or whatever the right glob syntax is.

Since we already have :perl5, I'd think that we'd want globbing
to be something like

rule jpeg :i :glob /*.jp{e,}g/

or, for something intra-rule-ish:

m :w / mv (:glob *.c)+ <dir> /

And perhaps we'd want a general form for specifying other
pattern syntaxes; i.e., :perl5 and :glob are shortcuts for
:syntax('perl5') and :syntax('glob') or something like that.

Pm

Patrick R. Michaud

unread,
Nov 21, 2005, 2:31:06 PM11/21/05
to Ruud H.G. van Tol, perl6-l...@perl.org
On Mon, Nov 21, 2005 at 07:57:59PM +0100, Ruud H.G. van Tol wrote:
>
> There is a "[[:alpha:][:digit:]" and a "[[:alpha:][:digit]]" on the
> A5-page.

Now fixed.

> > Besides, you have to be able to distinguish
> > s/^/foo/ from s/$/foo/.
>
> 's/$/foo/' becomes 's/<after .*>/foo/'
> <g>

Uh, no, because <after> is still a zero width assertion. :-)

Pm

Patrick R. Michaud

unread,
Nov 21, 2005, 5:27:56 PM11/21/05
to Ruud H.G. van Tol, perl6-l...@perl.org
On Mon, Nov 21, 2005 at 11:19:48PM +0100, Ruud H.G. van Tol wrote:
> Patrick R. Michaud:
>
> >> 's/$/foo/' becomes 's/<after .*>/foo/'
> >> <g>
> >
> > Uh, no, because <after> is still a zero width assertion. :-)
>
> That's why I chose it. It is not at the end-of-string?

Because ".*" matches "", /<after .*>/ would be true at

every position in the string, including the beginning,
and this is where "foo" would be substituted.

Pm

Patrick R. Michaud

unread,
Nov 21, 2005, 9:40:59 PM11/21/05
to Ruud H.G. van Tol, perl6-l...@perl.org
On Tue, Nov 22, 2005 at 01:09:40AM +0100, Ruud H.G. van Tol wrote:
> >>>> 's/$/foo/' becomes 's/<after .*>/foo/'
> >>>
> >>> Uh, no, because <after> is still a zero width assertion. :-)
> >>
> >> That's why I chose it. It is not at the end-of-string?
> >
> > Because ".*" matches "", /<after .*>/ would be true at
> > every position in the string, including the beginning,
> > and this is where "foo" would be substituted.
>
> I expected greediness, also because <after .*?> could behave non-greedy.
> ...

> But why does <after .*> behave non-greedy?

I think you may be misreading what <after .*> does -- it's a lookbehind
assertion. An assertion such as <after pattern> attempts to match
pattern to the sequence immediately preceding the current match position.
It does not mean "skip over pattern and then match whatever comes
afterwards".

The greediness of the .* subpattern in <after .*> doesn't affect
things at all -- <after .*> is still a zero-width assertion.

Since ".*" can match at every position, <after .*> will be
a successful zero-width match (i.e., a null string) at every
position in the target string, including the beginning.

So, s/<after .*>/foo/ matches the first null string it finds
-- the one at the beginning of the string -- and replaces it
with "foo". It's the same as if you had written s/<null>/foo/,
since <after .*> and <null> will both end up matching exactly
the same (i.e., a zero-width string at any position).

If this still doesn't make any sense, contact me off-list and
I'll try and explain it there.

Pm

Larry Wall

unread,
Nov 22, 2005, 10:52:24 AM11/22/05
to perl6-l...@perl.org
On Mon, Nov 21, 2005 at 11:25:20AM -0600, Patrick R. Michaud wrote:

: On Mon, Nov 21, 2005 at 09:02:57AM -0800, Larry Wall wrote:
: > : There's also <sp>, unless someone redefines the <sp> subrule.
: >
: > But you can't use <sp> in a character class. Well, that is, unless
: > you write it:
: >
: > <+[ a..z ]+<sp>>
: >
: > or some such. Maybe that's good enough.
:
: Er, that's now <+[ a..z ]+sp>, unless you're now changing it back.

No, just me going senile.

: > : And in the general case that's a slightly more expensive mechanism

: > : to get a space (it involves at least a subrule lookup). Perhaps
: > : we could also create a visible meta sequence for it, in the same
: > : way that we have visible metas for \e, \f, \r, \t. But I have
: > : no idea what letter we might use there.
: >
: > Something to be said for \_ in that regard.
:
: Yes, I thought of \_ but mentally I still have trouble
: classifying "_" along with the alphabetics -- '_' looks more
: like punctuation to me. And in general we use backslashes
: in front of metacharacters to remove their meta meaning
: (or when we aren't sure if a character has a meta meaning),
: so that \_ somehow seems like it ought to be a literal
: underscore, guarding against the possibility that the unescaped
: underscore has a meta meaning. (And yes, I can shoot
: holes in this line of thinking along with everyone else.)

I think we'll leave both _ and \_ meaning the same thing, just to avoid
that confusion path--I've seen people backwhacking anything remotely
resembling punctuation just in case it's a metacharacter, and if they
are confused about _, they might backwhack it. More to the point,
I think <sp> and +sp are about the right Huffman length, given that
matching a single space is usually wrong. You usually want \s or \s*.

: Whatever shortcuts we introduce, I'll be happy if we can just


: rule that backslash+space (i.e., "\ ") is a literal space
: character -- i.e., keeping the principle that placing a backslash
: in front of a metacharacter removes that character's "meta"
: behavior.

Yes, that will be a space.

: > I dunno. If «...» in ordinary code does shell quoting, maybe «...» in


: > rules does filename globbing or some such. I can see some issues with
: > anchoring semantics. Makes more sense on a string as a whole, but maybe
: > can anchor on element boundaries if used on a list of filenames.
: > I suppose one could even go as far as
: >
: > rule jpeg :i « *.jp{e,}g »
: >
: > or whatever the right glob syntax is.
:
: Since we already have :perl5, I'd think that we'd want globbing
: to be something like
:
: rule jpeg :i :glob /*.jp{e,}g/
:
: or, for something intra-rule-ish:
:
: m :w / mv (:glob *.c)+ <dir> /

Yep, that's what I decided in my other message that was thinking about
using < ... > for word boundaries and << ... >> for capturing $<>.

: And perhaps we'd want a general form for specifying other

: pattern syntaxes; i.e., :perl5 and :glob are shortcuts for
: :syntax('perl5') and :syntax('glob') or something like that.

Maybe. Or maybe it's enough that there are syntactic categories for
adding rule modifiers. Doesn't seem like you'd want to parameterize
the current language very often.

Larry

Patrick R. Michaud

unread,
Nov 22, 2005, 11:14:52 AM11/22/05
to perl6-l...@perl.org
On Tue, Nov 22, 2005 at 07:52:24AM -0800, Larry Wall wrote:
>
> I think we'll leave both _ and \_ meaning the same thing, just to avoid
> that confusion path [...]

Yay!

> : Whatever shortcuts we introduce, I'll be happy if we can just
> : rule that backslash+space (i.e., "\ ") is a literal space
> : character -- i.e., keeping the principle that placing a backslash
> : in front of a metacharacter removes that character's "meta"
> : behavior.
>
> Yes, that will be a space.

Yay!

> : Since we already have :perl5, I'd think that we'd want globbing
> : to be something like
> : rule jpeg :i :glob /*.jp{e,}g/
> : or, for something intra-rule-ish:
> : m :w / mv (:glob *.c)+ <dir> /
>
> Yep, that's what I decided in my other message that was thinking about
> using < ... > for word boundaries and << ... >> for capturing $<>.

Yay! (Our messages on this crossed in the mail; mine was moderated for
some reason but that's been corrected.)

> : And perhaps we'd want a general form for specifying other
> : pattern syntaxes; i.e., :perl5 and :glob are shortcuts for
> : :syntax('perl5') and :syntax('glob') or something like that.
>
> Maybe. Or maybe it's enough that there are syntactic categories for
> adding rule modifiers. Doesn't seem like you'd want to parameterize
> the current language very often.

At least within PGE, I'm starting to come across the situation
where each application and host language wants its own slight variations
of the regular expression syntax (for compatibility reasons).
And I figured that since we (conjecturally) have C<:lang('PIR')>,
C<:lang('Python')> and C<:lang('TCL')> to indicate the language
to be used for the closures within a rule, it might be nice to
have a similar parameterized modifier for the pattern syntax
itself.

I was also thinking that one of the tricky parts to custom rule
modifiers such as :perl and :glob is that they actually change
the parsing for whatever follows, so it might be nice to have
a parameterized form to hook into rather than defining a custom
modifier for each syntax variant. But on thinking about it
further from an implementation perspective I guess it all comes
out the same anyway...

Pm

Damian Conway

unread,
Nov 22, 2005, 4:19:04 AM11/22/05
to Patrick R. Michaud, perl6-l...@perl.org
Patrick wrote:

> Since we already have :perl5, I'd think that we'd want globbing
> to be something like
>
> rule jpeg :i :glob /*.jp{e,}g/
>
> or, for something intra-rule-ish:
>
> m :w / mv (:glob *.c)+ <dir> /

Here! Here!

> And perhaps we'd want a general form for specifying other
> pattern syntaxes; i.e., :perl5 and :glob are shortcuts for
> :syntax('perl5') and :syntax('glob') or something like that.

Agreed.

Damian

Larry Wall

unread,
Nov 22, 2005, 12:23:01 PM11/22/05
to perl6-l...@perl.org
On Tue, Nov 22, 2005 at 08:19:04PM +1100, Damian Conway wrote:
: >And perhaps we'd want a general form for specifying other
: >pattern syntaxes; i.e., :perl5 and :glob are shortcuts for
: >:syntax('perl5') and :syntax('glob') or something like that.
:
: Agreed.

But the language in the following lexical scope is a constant, so what can
:syntax($foo) possibly mean? [Wait, this is Damian I'm talking to.]
Nevermind, don't answer that...

And there aren't that many regexish languages anyway. So I think :syntax
is relatively useless except for documentation, and in practice people
will almost always omit it, which makes it even less useful, and pretty
nearly kicks it over into the category of multiplied entities for me.

Larry

Dave Whipp

unread,
Nov 22, 2005, 12:46:59 PM11/22/05
to perl6-l...@perl.org
Larry Wall wrote:

> And there aren't that many regexish languages anyway. So I think :syntax
> is relatively useless except for documentation, and in practice people
> will almost always omit it, which makes it even less useful, and pretty
> nearly kicks it over into the category of multiplied entities for me.

Its surprising how many are out there. Even if we ignore the various
dialects of standard rexen, we can find interesting examples such as
PSL, a language for specifying temporal assertions, for hardware design:
http://www.project-veripage.com/psl_tutorial_5.php. Whether one would
want to fold this syntax into a C<rule> is a different question.

There are actually a number of competing languages in this space. E.g.
http://www.pslsugar.org/papers/pslandsva.pdf.

Larry Wall

unread,
Nov 22, 2005, 1:30:20 PM11/22/05
to perl6-l...@perl.org
On Tue, Nov 22, 2005 at 09:46:59AM -0800, Dave Whipp wrote:
: Larry Wall wrote:
:
: >And there aren't that many regexish languages anyway. So I think :syntax
: >is relatively useless except for documentation, and in practice people
: >will almost always omit it, which makes it even less useful, and pretty
: >nearly kicks it over into the category of multiplied entities for me.
:
: Its surprising how many are out there.

We can certainly add a :syntax() modifier as easily as a :foolang modifier,
if we decide at some point we really need one, or if PGE could make good
use of it even if Perl 6 doesn't want it.

Larry

Patrick R. Michaud

unread,
Nov 22, 2005, 1:39:49 PM11/22/05
to perl6-l...@perl.org

I'm agreeing with Larry on this one -- let's wait to decide this
until we actually feel like we need it.

Pm

Patrick R. Michaud

unread,
Nov 22, 2005, 1:48:39 PM11/22/05
to perl6-l...@perl.org
On Mon, Nov 21, 2005 at 09:02:57AM -0800, Larry Wall wrote:
> On Sun, Nov 20, 2005 at 10:27:17AM -0600, Patrick R. Michaud wrote:
> : On Sat, Nov 19, 2005 at 06:32:17PM -0800, Larry Wall wrote:
> : > We already have, from A5, \x[0a;0d], so you can supposedly say
> : > "\x[123a;123b;123c]"
> :
> : Hmm, I hadn't caught that particular syntax in A05. AFAIK it's not
> : in S05, so I should probably add it, or whatever syntax we end up
> : adopting.
>
> Yes.

Out of curiosity (and so I can update S05 and PGE), what syntax
are we adopting? Is it semicolon, comma, space, any combination of the
three, or ...?

Pm

Larry Wall

unread,
Nov 22, 2005, 2:03:35 PM11/22/05
to perl6-l...@perl.org
On Tue, Nov 22, 2005 at 12:48:39PM -0600, Patrick R. Michaud wrote:

S02.pod currently has it as comma.

Larry

Damian Conway

unread,
Nov 22, 2005, 3:51:32 PM11/22/05
to Larry Wall, perl6-l...@perl.org
Larry wrote:

> But the language in the following lexical scope is a constant, so what can
> :syntax($foo) possibly mean? [Wait, this is Damian I'm talking to.]
> Nevermind, don't answer that...

Too late! ;-)

Regex syntaxes already are a twisty maze of variations, mostly alike. I
can easily envisage Perl users occasionally needing/wanting/using
patterns which are any of:

:syntax<POSIX>
:syntax<grep>
:syntax<egrep>
:syntax<vim>
:syntax<Snobol>
:syntax<Google>

Not just because people are used to different syntaxes, but also because
programs will want to accept search patterns in different (generally: more
restrictive) syntaxes so as to be able to interpolate them safely:

use Regex::Google;

for =<> :prompt<Find:> -> $search {
for @texts {
say if m:syntax<Google>/$search/;
}
}


> And there aren't that many regexish languages anyway.

That depends on how broadly you define regexish. Search is a *very* common
activity and people are (re-)inventing notations for it all the time.

Damian

Luke Palmer

unread,
Nov 23, 2005, 3:24:23 AM11/23/05
to dam...@conway.org, Larry Wall, perl6-l...@perl.org
On 11/22/05, Damian Conway <dam...@conway.org> wrote:
> :syntax<POSIX>
> :syntax<grep>
> :syntax<egrep>
> :syntax<vim>
> :syntax<Snobol>
> :syntax<Google>

Aren't we providing an interface to define your own regex modifiers?
All of these can easily be mapped into Perl 6 patterns, so...

Modules welcome! ;-)

Luke

Damian Conway

unread,
Nov 23, 2005, 4:33:16 AM11/23/05
to Luke Palmer, perl6-l...@perl.org
Luke wrote:

> On 11/22/05, Damian Conway <dam...@conway.org> wrote:
>
>> :syntax<POSIX>
>> :syntax<grep>
>> :syntax<egrep>
>> :syntax<vim>
>> :syntax<Snobol>
>> :syntax<Google>
>
>
> Aren't we providing an interface to define your own regex modifiers?

Sure. But it'd lead to much less namespace pollution and much greater
readability if there were only one standard modifier that subsumed all future
possibilities.

Damian

Luke Palmer

unread,
Nov 23, 2005, 10:49:29 AM11/23/05
to dam...@conway.org, perl6-l...@perl.org

Okay, I don't think this is an important part of the design of the
language, so I'll not fuss over it. However, I think it's a good case
study that covers some important issues.

Something I've learned from Haskell: if you have the following three things:

* Fine-grained control over your lexical environment
* A compiler that tells you when you're referring to something
ambiguously, rather than just having one symbol hide the other
* A way in all cases to refer to an export that you have *not*
imported, using some fully-qualified form

Then namespace pollution is not an issue... at all.

I don't believe we have the last of those for regex modifiers. We
should get it.

Now, by grouping all these different modules under a standard :syntax,
the following things follow:

* The string has to be evaluated at compile-time; i.e. $syn =
"vim"; rx:syntax($syn)/.../ is not legal. It takes a fair amount of
maturity in the workings of Perl to understand why that doesn't work.
* Various :syntax modifiers will probably do very different things
inside the regex (one approach is not likely to work for everyone), so
our common interface will add little to nothing over the standard
modifier interface, other than another name to sift through in the
docs.
* We have to provide our own registry for these, which map strings
onto implementations, rather than using the already existing symbol
table. By reinventing this registry, we lose fully-qualified
referrability to these symbols, which was one of the requirements for
avoiding namespace pollution. It seems we've made that problem
worse... but see below: :-)
* The :syntax modifiers cannot take arguments, further worsening
the namespace pollution problem rather than helping it. Or... if they
can take arguments, it would be like this: :syntax['POSIX',
:charclasses], which means reinventing the calling conventions already
in place in much of the language (and losing some features in the
process of reinventing, as usual), and losing any type information in
the process. Also it's butt ugly.

So... those have some strong language in there. I don't really want
to go back and sugar them up, but pretend I did.

Anyway, I think the biggest point is that when you substitute a string
for a first-class object, you have to emulate (poorly) all the fancy
mechanisms that the language designers have meticulously crafted.

Luke

0 new messages