\033 - octal (p5; deprecated but allowed in p6?)
\o33 - octal (p5)
\x1b - hex (p5)
\d123 - decimal (?)
\b1001 - binary (?)
and if so, if these are allowed too:
\o{777} - (?)
\x{1b} - "wide" hex (p5)
\d{123} - (?)
\b{1001} - (?)
Only four of these nine constructs are allowed in Perl5.
Note that \b conflicts with backspace. I'd rather keep backspace than
binary, personally; I have yet to feel the need to call out a char in
binary. :-) Or we can make it dependent on the trailing digits, or
require the brackets, or require backspace to be spelt differently.
But I think we'd definitely like to introduce \d.
There is also the question of what the bracketed format does. "Wide"
chars, e.g. for Unicode, seem appropriate only in hex. But it would
seem useful to allow a bracketed form for the others that prevents
ambiguities:
"\o164" ne "\o{16}4"
"\d100" ne "\d{10}0"
Whether that means you can actually specify wide chars in \o, \d, and
\b or it's just a disambiguification of the Latin-1 case is open to
question.
MikeL
*Not to be confused with an eigendisambiguification, of course.
Our numeric literals use # for radix stuff. So perhaps we could use "\#..."
to introduce explicit codings:
"\#d13"
"\#h0d"
"\#b1101"
"\#{ 1<<6 - 20 * 2 - 9#1:2 }"
would all be synonyms!
Dave.
ps. how did this thread migrate from p6d to p6l?
I think it's disallowed.
: \o33 - octal (p5)
: \x1b - hex (p5)
: \d123 - decimal (?)
: \b1001 - binary (?)
Can't really have \d and \b if they keep their current regex meanings.
I think the general form is:
\0o33 - octal
\0x1b - hex
\0d123 - decimal
\0b1001 - binary
\x and \o are then just shortcuts.
: and if so, if these are allowed too:
:
: \o{777} - (?)
: \x{1b} - "wide" hex (p5)
: \d{123} - (?)
: \b{1001} - (?)
The general form could be
\0o[33] - octal
\0x[1b] - hex
\0d[123] - decimal
\0b[1001] - binary
Or it could be
\c[0o33] - octal
\c[0x1b] - hex
\c[0d123] - decimal
\c[0b1001] - binary
since \c is taking over \N's (rather ill-defined) duties.
: Note that \b conflicts with backspace. I'd rather keep backspace than
: binary, personally; I have yet to feel the need to call out a char in
: binary. :-) Or we can make it dependent on the trailing digits, or
: require the brackets, or require backspace to be spelt differently.
\c[^H], for instance. We can overload the \c notation to our heart's
desire, as long as we don't conflict with its use for named characters:
\c[GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI]
: But I think we'd definitely like to introduce \d.
Can't, unless we change \d to <digit> in regexen.
: There is also the question of what the bracketed format does. "Wide"
: chars, e.g. for Unicode, seem appropriate only in hex. But it would
: seem useful to allow a bracketed form for the others that prevents
: ambiguities:
:
: "\o164" ne "\o{16}4"
: "\d100" ne "\d{10}0"
:
: Whether that means you can actually specify wide chars in \o, \d, and
: \b or it's just a disambiguification of the Latin-1 case is open to
: question.
There ain't no such thing as a "wide" character. \xff is exactly
the same character as \x[ff]. A character in Perl is an abstract
codepoint number--how it's represented is of no concern to the
programmer (though it might be of concern to any interface to the
outside world, of course). Do not think of Perl 6 strings as arrays
of bytes (except when they are (and probably not even then...)).
Larry
> : But I think we'd definitely like to introduce \d.
>
> Can't, unless we change \d to <digit> in regexen.
Which we ought to be very wary of, given how very frequently it's
used in regexes.
Damian
I like that *a lot*, especially the change to square brackets.
> \c[^H], for instance. We can overload the \c notation to our heart's
> desire, as long as we don't conflict with its use for named characters:
.... and that ...
> There ain't no such thing as a "wide" character. \xff is exactly
> the same character as \x[ff].
.... and that, thank goodness.
I think that solves all the problems we're having. We change \c to
have more flexible meanings, with \0o, \0x, \0d, \0b, \o, \x as
shortcuts. Boom, we're done. Thanks!
MikeL
How far can we go with this \c thing? How about:
print "\c[72, 101, 108, 108, 111]";
will that print "Hello"?
Dave.
> \0o33 - octal
> \0x1b - hex
> \0d123 - decimal
> \0b1001 - binary
> \x and \o are then just shortcuts.
Can we please also have \0 as a shortcut for \0x0?
> \c[^H], for instance. We can overload the \c notation to our heart's
> desire, as long as we don't conflict with its use for named characters:
>
> \c[GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI]
Very Cool. (BTW, for those that don't follow Unicode, this means that
everything matching /^[^A-Z ]$/ is fair game for us; Unicode limits
charachter names to that to minimize chicken-and-egg problems. We
/probably/ shouldn't take anything in /^[A-Za-z ]$/, to allow people to
say the much more readable "\c[Greek Capital Letter Omega with Pepperoni
and Pineapple]".
> : There is also the question of what the bracketed format does. "Wide"
> : chars, e.g. for Unicode, seem appropriate only in hex. But it would
> : seem useful to allow a bracketed form for the others that prevents
> : ambiguities:
> :
> : "\o164" ne "\o{16}4"
> : "\d100" ne "\d{10}0"
> :
> : Whether that means you can actually specify wide chars in \o, \d, and
> : \b or it's just a disambiguification of the Latin-1 case is open to
> : question.
>
> There ain't no such thing as a "wide" character. \xff is exactly
> the same character as \x[ff].
Which means that the only way to get a string with a literal 0xFF byte
in it is with qq:u1[\xFF]? (Larry, I don't know that this has been
mentioned before: is that right?) chr:u1(0xFF) might do it too, but
we're getting ahead of ourselves.
Also, an annoying corner case: is "\0x1ff" eq "\0x[1f]f", or is it eq
"\0x[1ff]"? What about other bases? Is "\0x1x" eq "\0x[1]", or is it
eq "\0x[1x]" (IE illegal). (Now that I put those three questions
together, the only reasonable answer seems to be that the number ends in
the last place it's valid to end if you don't use explicit brackets.)
(BTW, in HTML and XML, numeric character escapes are decimal by default,
you have to add a # for hex. In windows and several other OSes (I
think, I like to play with Unicode but have little actual use for it),
ALT-0nnn is spelt in decimal only. Decimal Unicode ordnals are
fundimently flawed (since blocks are always on nice even hex numbers,
but ugly decimal ones), but useful anyway).
-=- James Mastros
> On 12/04/2002 3:21 PM, Larry Wall wrote:
>> \x and \o are then just shortcuts.
> Can we please also have \0 as a shortcut for \0x0?
\0 in addition to \x, meaning the same thing? I think that would get
us back to where we were with octal, wouldn't it? I'm not real keen on
leading zero meaning anything, personally... :-P
>> There ain't no such thing as a "wide" character. \xff is exactly
>> the same character as \x[ff].
> Which means that the only way to get a string with a literal 0xFF byte
> in it is with qq:u1[\xFF]? (Larry, I don't know that this has been
> mentioned before: is that right?) chr:u1(0xFF) might do it too, but
> we're getting ahead of ourselves.
Hmm... does this matter? I'm a bit rusty on my Unicode these days, but
I was assuming that \xFF and \x00FF always pointed to the same
character, and that you in fact _don't_ have the ability to put
individual bytes in a string, because Perl is deciding how to place the
characters for you (how long they should be, etc.) So if you wanted
more explicit control, you'd use C<pack>.
> Also, an annoying corner case: is "\0x1ff" eq "\0x[1f]f", or is it eq
> "\0x[1ff]"? What about other bases? Is "\0x1x" eq "\0x[1]", or is it
> eq "\0x[1x]" (IE illegal). (Now that I put those three questions
> together, the only reasonable answer seems to be that the number ends
> in the last place it's valid to end if you don't use explicit
> brackets.)
Yeah, my guess is that it's as you say... it goes till it can't goes no
more, but never gives an error (well, maybe for "\0xz", where there are
zero valid digits?) But I would suspect that the bracketed form is
*strongly* recommended. At least, that's what I plan on telling
people. :-)
Design team: If we're wrong on these, please correct. :-)
MikeL
Huh... having a comma-separated list to represent multiple characters.
I can't think of any problems with that, and it would be marginally
easier for some sequences...
Unless someone on the design team objects, I'd say let's go for it.
MikeL
\0 still means chr(0). I don't think there's much conflict with
the new \0x, \0o, \0b, and \0d, since \0 almost always occurs at the
end of a string, if anywhere.
: >>There ain't no such thing as a "wide" character. \xff is exactly
: >>the same character as \x[ff].
: >Which means that the only way to get a string with a literal 0xFF byte
: >in it is with qq:u1[\xFF]? (Larry, I don't know that this has been
: >mentioned before: is that right?) chr:u1(0xFF) might do it too, but
: >we're getting ahead of ourselves.
:
: Hmm... does this matter? I'm a bit rusty on my Unicode these days, but
: I was assuming that \xFF and \x00FF always pointed to the same
: character, and that you in fact _don't_ have the ability to put
: individual bytes in a string, because Perl is deciding how to place the
: characters for you (how long they should be, etc.) So if you wanted
: more explicit control, you'd use C<pack>.
A "byte" string is any string whose characters are all under 256. It's
up to an interface to coerce this to actual bytes if it needs them.
We'll presumably have something like "use bytes" that turns off all
multi-byte processing, in which case you have to deal with any UTF that
comes in by hand. But in general it'll be better if the interface coerces
to types like "str8", which is presumably pronouced "straight".
Don't ask me how str16 and str32 are pronounced. (But generally you should
be using utf16 instead of str16 in any event, unless your interface truly
doesn't know how to deal with surrogates.) In other words, str16 is
the name of the obsolescent UCS-2, and str32 is the name for UCS-4, which
is more or less the same as UTF-32, except that UTF-32 is not allowed to
use the bits above 0x10ffff.
So anyway, we've got all these types:
str8 utf8
str16 utf16
str32 utf32
where the "str" version is essentially just a compact integer array. One could
alias str8 to "latin1" since the default coercion from Unicode to str8 would
have those semantics.
It's not clear exactly what the bare "str" type is. "Str" is obviously
the abstract string type, but "str" probably means the default C string
type for the current architecture/OS/locale/whatever. In other words,
it might be str8, or it might be utf8. Let's hope it's utf8, because
that will work forever, give or take an eon.
: >Also, an annoying corner case: is "\0x1ff" eq "\0x[1f]f", or is it eq
: >"\0x[1ff]"? What about other bases? Is "\0x1x" eq "\0x[1]", or is it
: >eq "\0x[1x]" (IE illegal). (Now that I put those three questions
: >together, the only reasonable answer seems to be that the number ends
: >in the last place it's valid to end if you don't use explicit
: >brackets.)
:
: Yeah, my guess is that it's as you say... it goes till it can't goes no
: more, but never gives an error (well, maybe for "\0xz", where there are
: zero valid digits?) But I would suspect that the bracketed form is
: *strongly* recommended. At least, that's what I plan on telling
: people. :-)
Sounds good to me. Dwimming is wonderful, but so is dwissing.
Larry
> Huh... having a comma-separated list to represent multiple characters.
> I can't think of any problems with that, and it would be marginally
> easier for some sequences...
>
> Unless someone on the design team objects, I'd say let's go for it.
Larry was certainly in favour of it when he wrote A5
(see under http://search.cpan.org/perl6/apo/A05.pod#Backslash_Reform).
Except the separators he suggests are semicolons:
Perl 5 Perl 6
\x0a\x0d \x[0a;0d] # CRLF
\x0a\x0d \c[CR;LF] # CRLF (conjectural)
Damian
> On Thursday, December 5, 2002, at 02:11 AM, James Mastros wrote:
>
>> On 12/04/2002 3:21 PM, Larry Wall wrote:
>>
>>> \x and \o are then just shortcuts.
>>
>> Can we please also have \0 as a shortcut for \0x0?
>
> \0 in addition to \x, meaning the same thing? I think that would get
> us back to where we were with octal, wouldn't it? I'm not real keen
> on leading zero meaning anything, personally... :-P
You misinterpret. I meant \0 meaning the same as \c[NUL], IE the same
as chr(0), a null character. (I suppse I should have said \0x[0].)
>> Which means that the only way to get a string with a literal 0xFF
>> byte in it is with qq:u1[\xFF]? (Larry, I don't know that this has
>> been mentioned before: is that right?) chr:u1(0xFF) might do it too,
>> but we're getting ahead of ourselves.
>
> Hmm... does this matter?
Sorry. It does, in fact, not matter... momentarly stopped thinking in
terms of utf8 encoding being a completly transparent process.
I just had this thought - can I interpolate in there?
Something like
"\c[$(call_a_func())]"
and I interpolate the string returned by call_a_func() using whatever
interpolation system it finds itself in. (so the same function could
create the literal control characters if it finds itself in "\c[...]", or
the pretty-printed version if it finds itself within "\\c[...]"
I'm not sure if this is useful - I think its only benefit is to save a
string eval occasionally, with the downside of being obscure, complicating
understanding (interpolation becomes run time rather than compile time if
it's non-constant), and possibly the implementation.
[No I'm not smoking anything. I'm drinking tea today. Maybe it's just side
effects of over-exposure to london.pm yesterday, or reading Damian's book
(If it moves^Wis a reference - bless it! :-))]
Nicholas Clark
PS Time for a second edition:
perl5.8.0 -lwe 'format FOO = ' -e '.' -e '$a = *FOO{FORMAT}; print ref $a'
FORMAT
> I just had this thought - can I interpolate in there?
>
> Something like
> "\c[$(call_a_func())]"
Why not just:
"$(chr call_a_func()]"
???
Damian
Well, I was wondering if my function returned "CR", then
"\c[$(call_a_func())]" would mean that the "CR" gets run thought the
\c[...] conversion and a single byte ("\r") is what ends up in the string.
Even if that's of little practical use, what does the language design team
say should happen when perl6's 1 level lexer finds something that looks like
an attempt to interpolate within one of the other double-quotish
constructions?
[should it (a) phone a friend, (b) ask the audience (c) 50/50 ?]
Q "Is that your final answer?" A "for this week" :-)
Nicholas Clark
--
INTERCAL better than perl? http://www.perl.org/advocacy/spoofathon/
>>Why not just:
>>
>> "$(chr call_a_func()]"
>>
>>???
>
>
> Well, I was wondering if my function returned "CR", then
> "\c[$(call_a_func())]" would mean that the "CR" gets run thought the
> \c[...] conversion and a single byte ("\r") is what ends up in the string.
I seriously doubt it. %-)
Damian
> >Well, I was wondering if my function returned "CR", then
> >"\c[$(call_a_func())]" would mean that the "CR" gets run thought the
> >\c[...] conversion and a single byte ("\r") is what ends up in the string.
>
> I seriously doubt it. %-)
So what will perl6's "" parser do if it is presented with what appears to
be a $() interpolation sequence inside some other double quoting
construction?
1: Proceed silently (so treating "\c[$(call_a_func())]" as a request for
the character literally named '$(call_a_func())'), potentially returning
whatever warnings fall out of that second stage
2: Issue a nesting warning when it finds something that seems to be an
interpolative construction occurring within another construction, but
otherwise carry on
3: Treat it as a syntax error
I suspect that the answer is "yes" - ie all of the above, with a warning on
nesting interpolative constructions, which can be made fatal, and can be made
silent.
Nicholas Clark
--
Brainfuck better than perl? http://www.perl.org/advocacy/spoofathon/
You must remember that the Perl 6 parser is one-pass now. An
interpolating string has a rule set of its own, whose simplified
version might look like this:
grammar interpolating_string {
rule string { " <string_thing>* " }
rule string_thing { \$\( <Perl::expression> \)
| \\c \[ .*? \]
| <character> }
rule character { \\ \\ | \\ " | . }
}
So I imagine that it would do your no. 1, as it's not looking for an
interpolating construct.
Luke
It is? Are you sure?
--
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk
A fuller version might look like this:
(Note that this is with only allowing [] for internal brackets and
without accounting for the "balanced-brackets are ok" rule (out of
curiousity, is that still allowed?))
grammar Interpolating_String {
rule string {
$delim := <head>
{ $delim = %matching{$delim} || $delim }
(<body($delim)>+)
$delim
}
rule head {
(") | qq (<special_delim>) (
}
rule special_delim {
<-[\w\s]>
}
rule body(Str $delim) {
\\ <backslashed_expr>
| <interpolated_value>
| <interpolated_variable>
| <-[ \\\$\@\%\&$delim ]>*
}
rule interpolated_variable {
<Perl::Expression::variable>
<str_subscript>*
<!before <[ \[\{\( ]> >
}
rule backslashed_expr {
\\Q <!before \[>
| <before \\<[qQ]>\[ >
<Perl::Literal::Non_Interpolating_String::body(']')>
| c \[ <string_set> \]
| \: \[ <string_set> \]
| x \[ <set <Perl::Literal::Hex::number>>+ \]
| x <Perl::Literal::Hex::number>
| <[UL]> \[ <Perl::Literal::Non_Interpolating_String::body(']')>
| 0 <Perl::Literal::number>
| 0 \[ <Perl::Literal::number> \]
| <[ule]> .
| .
}
rule string_set {
[
<!before <[:\]]>+ \] >
<Perl::Literal::Non_Interpolating_String::body(';')>
]*
<Perl::Literal::Non_Interpolating_String::body(']')>
}
rule set($item) {
$item [ <before ;$item | \] ]
}
rule str_subscript {
<!before \s|\Q>
<Perl::Expression::subscript>
}
rule interpolated_value {
<Perl::Expression::Variable::sigil>
\( <Perl::Expression::comma> \)
}
}
> At 5:11 PM -0700 12/9/02, Luke Palmer wrote:
>
>> You must remember that the Perl 6 parser is one-pass now.
>
>
> It is? Are you sure?
It should be; the raw parsed data might be treated with regular
expressions in the parse-tree processing stage, but that shouldn't
count as a second pass.
Doesn't mean it will be. And "should" is an awfully strong word...
> At 10:16 PM -0500 12/9/02, Joseph F. Ryan wrote:
>
>> Dan Sugalski wrote:
>>
>>> At 5:11 PM -0700 12/9/02, Luke Palmer wrote:
>>>
>>>> You must remember that the Perl 6 parser is one-pass now.
>>>
>>>
>>>
>>> It is? Are you sure?
>>
>>
>>
>> It should be;
>
>
> Doesn't mean it will be. And "should" is an awfully strong word...
This is true; however, I don't see where anything would need more
than 1 pass, except to reduce complexity in some places. After all,
since the parser is to be constructed with regular expressions, then a
second pass would only be another use of regular expressions, which
means the "second pass" could have been included in the parser in the
first place. Of course, the parser isn't close to finish yet; as such,
its anyone's guess as to what the final product will be. :)
Joseph F. Ryan
ryan...@osu.edu