How does one explain this to an audience that likely isn't convinced regexes
are all that important in the first place? "Sure it's line noise, but it's
new and improved line noise!" I may have to avoid the topic of regex
improvements unless I can cover it in < 5 minutes. Maybe a quick poll of
how many people are using one of the many Perl5-like regex libraries, if
there's a high portion then talk about the new regex stuff.
Grammars, OTOH, is something I think I'll mention.
I also forgot hyperoperators. Also it's likely worth mentioning that perl's
method call syntax will switch to the dot making it look more like other
languages.
Unicode from the ground up is probably also worth mentioning, though I'm not
quite sure what forms this will take other than "Unicode will not be an
awkward, bolt-on feature". I don't know how Java and Python handle Unicode.
--
This sig file temporarily out of order.
I don't know about Python.
--
Mark REED | CNN Internet Technology
1 CNN Center Rm SW0831G | mark...@cnn.com
Atlanta, GA 30348 USA | +1 404 827 4754
--
You're never too old to become younger.
-- Mae West
I thought Java used UTF-16. It's a variable-width encoding, so it
should be fine. (Though I bet a lot of folks will be rather surprised
when it happens...)
--
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk
--
Mark REED | CNN Internet Technology
1 CNN Center Rm SW0831G | mark...@cnn.com
Atlanta, GA 30348 USA | +1 404 827 4754
--
There are no rules for March. March is spring, sort of, usually. March
means maybe, but don't bet on it.
There were originally two separate universal character set efforts,
by the ISO and the Unicode Consortium. They decided early on to
combine their efforts and be mutually compatible.
However, ISO-10646 was designed as a 32-bit code, consisting
of 65,536 16-bit "planes", while Unicode was only 16 bits.
So Unicode is identical to plane 0 of ISO-10646, called the
Basic Multilingual Plane (BMP). So far, the ISO has no characters
defined outside of this plane.
It does plan to define some eventually, however (in ISO-10646-2), and
this is handled in Unicode through a section of the code space called
"surrogates", which are used in the UTF-16 encoding to reach planes
1-16 of ISO-10646.
ISO has no plans to define characters outside of planes 1-16 anytime
in the foreseeable future (or, indeed, outside of planes 1-14, since
15 and 16 are reserved for private use).
--
Mark REED | CNN Internet Technology
1 CNN Center Rm SW0831G | mark...@cnn.com
Atlanta, GA 30348 USA | +1 404 827 4754
--
The end of the world will occur at three p.m., this Friday, with
symposium to follow.
My understanding was that Unicode has now escaped the base plane (or whatever
it's called) and now has started using code points >65536. How does Java
cope with this?
Nicholas Clark
Yeah, I see that's how the standard defines it, but... Looks like a
serious dodge to me. :)
Since Unicode 3.1 (3.2 is the current version), there have in fact
been defined characters outside the 16-bit range U+0000 to U+FFFF.
For instance, the block U+1D100 to U+1D1FF contains musical symbols.
Since Java 'char's are 16-bit quantities; characters outside of
the range U+0000 to U+FFFF have to be represented by pairs of
characters from the 'surrogates' range, U+D800 through U+DFFF.
Java does not handle this conversion transparently; for instance,
the \uXXXX sequence to include a Unicode character code point
takes exactly four hexadecimal digits. So to represent, e.g.,
U+1D107 MUSICAL SYMBOL RIGHT REPEAT SIGN, you have to manually
compose the surrogate pair (U+D834, U+DD07). This is a good thing
from the point of view of the Java programmer since it means a
'char' is always the same size, even though it may not represent the
entire desired character. In that, however, it is not fundamentally
different from composition within the 16-bit range - e.g. composing
'a' (U+0061) and the combining version of '~' (U+0303) to get 'ã',
instead of using the single character U+00E3.
Note that surrogates are bypassed when encoding in UTF-8;
you just transform the desired code point directly, resulting in a
UTF-8 sequence of four octets (characters through U+FFFF require a
maximum of three octets in UTF-8). Perl 5.6.1 already handles this
correctly for \x{...} values greater than 0xffff; e.g.
perl -e 'print "\x{1d107}\n";' will output the four-byte UTF-8 encoding
for that character.
--
Mark REED | CNN Internet Technology
1 CNN Center Rm SW0831G | mark...@cnn.com
Atlanta, GA 30348 USA | +1 404 827 4754
--
Going the speed of light is bad for your age.