Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion Unicode 3.1, UTF-16, and Java [Re: Perl 6, The Good Parts Version]
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Mark J. Reed  
View profile  
 More options Jul 31 2002, 4:48 pm
Newsgroups: perl.perl6.language
From: mark.r...@turner.com (Mark J. Reed)
Date: Wed, 31 Jul 2002 16:26:40 -0400
Local: Wed, Jul 31 2002 4:26 pm
Subject: Unicode 3.1, UTF-16, and Java [Re: Perl 6, The Good Parts Version]
> On Wed, Jul 17, 2002 at 12:13:47PM -0400, Dan Sugalski wrote:
> > I thought Java used UTF-16. It's a variable-width encoding, so it
> > should be fine. (Though I bet a lot of folks will be rather surprised
> > when it happens...)

Update:

Since Unicode 3.1 (3.2 is the current version), there have in fact
been defined characters outside the 16-bit range U+0000 to U+FFFF.
For instance, the block U+1D100 to U+1D1FF contains musical symbols.

Since Java 'char's are 16-bit quantities; characters outside of
the range U+0000 to U+FFFF have to be represented by pairs of
characters from the 'surrogates' range, U+D800 through U+DFFF.
Java does not handle this conversion transparently; for instance,
the \uXXXX sequence to include a Unicode character code point
takes exactly four hexadecimal digits.  So to represent, e.g.,
U+1D107  MUSICAL SYMBOL RIGHT REPEAT SIGN, you have to manually
compose the surrogate pair (U+D834, U+DD07).  This is a good thing
from the point of view of the Java programmer since it means a
'char' is always the same size, even though it may not represent the
entire desired character.  In that, however, it is not fundamentally
different from composition within the 16-bit range - e.g. composing
'a' (U+0061) and the combining version of '~' (U+0303) to get 'ã',
instead of using the single character U+00E3.

Note that surrogates are bypassed when encoding in UTF-8;
you just transform the desired code point directly, resulting in a
UTF-8 sequence of four octets (characters through U+FFFF require a
maximum of three octets in UTF-8).  Perl 5.6.1 already handles this
correctly for \x{...} values greater than 0xffff; e.g.  
perl -e 'print "\x{1d107}\n";' will output the four-byte UTF-8 encoding
for that character.

--
Mark REED                    | CNN Internet Technology
1 CNN Center Rm SW0831G      | mark.r...@cnn.com
Atlanta, GA 30348      USA   | +1 404 827 4754
--
Going the speed of light is bad for your age.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.