Perl 6, The Good Parts Version

Michael G Schwern

unread,

Jul 16, 2002, 5:42:18 PM7/16/02

to Tim....@pobox.com, perl6-l...@perl.org

On Wed, Jul 03, 2002 at 10:52:58PM +0100, Tim Bunce wrote:
> Don't forget Apocalypse 5.
>
> Personally I believe the elegant and thorough integration of regular
> expressions and backtracking into the large-scale logic of an
> application is one of the most radical things about Perl 6.

How does one explain this to an audience that likely isn't convinced regexes
are all that important in the first place? "Sure it's line noise, but it's
new and improved line noise!" I may have to avoid the topic of regex
improvements unless I can cover it in < 5 minutes. Maybe a quick poll of
how many people are using one of the many Perl5-like regex libraries, if
there's a high portion then talk about the new regex stuff.

Grammars, OTOH, is something I think I'll mention.

I also forgot hyperoperators. Also it's likely worth mentioning that perl's
method call syntax will switch to the dot making it look more like other
languages.

Unicode from the ground up is probably also worth mentioning, though I'm not
quite sure what forms this will take other than "Unicode will not be an
awkward, bolt-on feature". I don't know how Java and Python handle Unicode.

--
This sig file temporarily out of order.

Mark J. Reed

unread,

Jul 17, 2002, 12:32:43 AM7/17/02

to perl6-l...@perl.org

On Tue, Jul 16, 2002 at 05:42:18PM -0400, Michael G Schwern wrote:
> I don't know how Java and Python handle Unicode.

Java has always been 100% Unicode from the ground up; it's in the spec.
The fundamental char type is a 16-bit value, you can use any "letterlike"
characters in identifiers, there's an escape sequence
to include untypable characters in strings, etc. I/O defaults to UTF-8
but you can arrange for other encodings.

I don't know about Python.

--
Mark REED | CNN Internet Technology
1 CNN Center Rm SW0831G | mark...@cnn.com
Atlanta, GA 30348 USA | +1 404 827 4754
--
You're never too old to become younger.
-- Mae West

Dan Sugalski

unread,

Jul 17, 2002, 12:13:47 PM7/17/02

to Nicholas Clark, Mark J. Reed, perl6-l...@perl.org

At 4:17 PM +0100 7/17/02, Nicholas Clark wrote:

>On Wed, Jul 17, 2002 at 12:32:43AM -0400, Mark J. Reed wrote:
>> On Tue, Jul 16, 2002 at 05:42:18PM -0400, Michael G Schwern wrote:
>> > I don't know how Java and Python handle Unicode.
>> Java has always been 100% Unicode from the ground up; it's in the spec.
>> The fundamental char type is a 16-bit value, you can use any "letterlike"
>

>My understanding was that Unicode has now escaped the base plane (or whatever
>it's called) and now has started using code points >65536. How does Java
>cope with this?

I thought Java used UTF-16. It's a variable-width encoding, so it
should be fine. (Though I bet a lot of folks will be rather surprised
when it happens...)
--
Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Mark J. Reed

unread,

Jul 17, 2002, 12:34:13 PM7/17/02

to perl6-l...@perl.org

On Wed, Jul 17, 2002 at 12:13:47PM -0400, Dan Sugalski wrote:
> I thought Java used UTF-16. It's a variable-width encoding, so it
> should be fine. (Though I bet a lot of folks will be rather surprised
> when it happens...)

UTF-16 isn't technically a variable-width encoding, since
surrogate codes are still considered single characters - even
though they only have meaning when combined in pairs. It's much
the same as multiple combining characters coming together to represent
a single abstract entity that is also not really a "character"; the
chief difference is that surrogates don't mean anything at all on their own.

--
Mark REED | CNN Internet Technology
1 CNN Center Rm SW0831G | mark...@cnn.com
Atlanta, GA 30348 USA | +1 404 827 4754
--

There are no rules for March. March is spring, sort of, usually. March
means maybe, but don't bet on it.

Mark J. Reed

unread,

Jul 17, 2002, 12:24:27 PM7/17/02

to Nicholas Clark, perl6-l...@perl.org

On Wed, Jul 17, 2002 at 04:17:15PM +0100, Nicholas Clark wrote:
> My understanding was that Unicode has now escaped the base plane (or whatever
> it's called) and now has started using code points >65536. How does Java
> cope with this?

This is getting a little off-topic, I think. But here's a brief overview
of the Unicode codespace size issue - if you have any more questions,
you can ask me off-list.

There were originally two separate universal character set efforts,
by the ISO and the Unicode Consortium. They decided early on to
combine their efforts and be mutually compatible.

However, ISO-10646 was designed as a 32-bit code, consisting
of 65,536 16-bit "planes", while Unicode was only 16 bits.
So Unicode is identical to plane 0 of ISO-10646, called the
Basic Multilingual Plane (BMP). So far, the ISO has no characters
defined outside of this plane.

It does plan to define some eventually, however (in ISO-10646-2), and
this is handled in Unicode through a section of the code space called
"surrogates", which are used in the UTF-16 encoding to reach planes
1-16 of ISO-10646.

ISO has no plans to define characters outside of planes 1-16 anytime
in the foreseeable future (or, indeed, outside of planes 1-14, since
15 and 16 are reserved for private use).

--
Mark REED | CNN Internet Technology
1 CNN Center Rm SW0831G | mark...@cnn.com
Atlanta, GA 30348 USA | +1 404 827 4754
--

The end of the world will occur at three p.m., this Friday, with
symposium to follow.

Nicholas Clark

unread,

Jul 17, 2002, 11:17:15 AM7/17/02

to Mark J. Reed, perl6-l...@perl.org

On Wed, Jul 17, 2002 at 12:32:43AM -0400, Mark J. Reed wrote:

> On Tue, Jul 16, 2002 at 05:42:18PM -0400, Michael G Schwern wrote:
> > I don't know how Java and Python handle Unicode.
> Java has always been 100% Unicode from the ground up; it's in the spec.
> The fundamental char type is a 16-bit value, you can use any "letterlike"

My understanding was that Unicode has now escaped the base plane (or whatever

it's called) and now has started using code points >65536. How does Java
cope with this?

Nicholas Clark

Dan Sugalski

unread,

Jul 17, 2002, 12:59:22 PM7/17/02

to perl6-l...@perl.org

At 12:34 PM -0400 7/17/02, Mark J. Reed wrote:
>On Wed, Jul 17, 2002 at 12:13:47PM -0400, Dan Sugalski wrote:
>> I thought Java used UTF-16. It's a variable-width encoding, so it
>> should be fine. (Though I bet a lot of folks will be rather surprised
>> when it happens...)
>UTF-16 isn't technically a variable-width encoding, since
>surrogate codes are still considered single characters - even
>though they only have meaning when combined in pairs. It's much
>the same as multiple combining characters coming together to represent
>a single abstract entity that is also not really a "character"; the
>chief difference is that surrogates don't mean anything at all on their own.

Yeah, I see that's how the standard defines it, but... Looks like a
serious dodge to me. :)

Mark J. Reed

unread,

Jul 31, 2002, 4:26:40 PM7/31/02

to perl6-l...@perl.org

> On Wed, Jul 17, 2002 at 12:13:47PM -0400, Dan Sugalski wrote:
> > I thought Java used UTF-16. It's a variable-width encoding, so it
> > should be fine. (Though I bet a lot of folks will be rather surprised
> > when it happens...)

Update:

Since Unicode 3.1 (3.2 is the current version), there have in fact
been defined characters outside the 16-bit range U+0000 to U+FFFF.
For instance, the block U+1D100 to U+1D1FF contains musical symbols.

Since Java 'char's are 16-bit quantities; characters outside of
the range U+0000 to U+FFFF have to be represented by pairs of
characters from the 'surrogates' range, U+D800 through U+DFFF.
Java does not handle this conversion transparently; for instance,
the \uXXXX sequence to include a Unicode character code point
takes exactly four hexadecimal digits. So to represent, e.g.,
U+1D107 MUSICAL SYMBOL RIGHT REPEAT SIGN, you have to manually
compose the surrogate pair (U+D834, U+DD07). This is a good thing
from the point of view of the Java programmer since it means a
'char' is always the same size, even though it may not represent the
entire desired character. In that, however, it is not fundamentally
different from composition within the 16-bit range - e.g. composing
'a' (U+0061) and the combining version of '~' (U+0303) to get 'ã',
instead of using the single character U+00E3.

Note that surrogates are bypassed when encoding in UTF-8;
you just transform the desired code point directly, resulting in a
UTF-8 sequence of four octets (characters through U+FFFF require a
maximum of three octets in UTF-8). Perl 5.6.1 already handles this
correctly for \x{...} values greater than 0xffff; e.g.
perl -e 'print "\x{1d107}\n";' will output the four-byte UTF-8 encoding
for that character.

--
Mark REED | CNN Internet Technology
1 CNN Center Rm SW0831G | mark...@cnn.com
Atlanta, GA 30348 USA | +1 404 827 4754
--

Going the speed of light is bad for your age.