I understand what the stream would look like in UTF-8 or a int[], but
what I am curious about is the cleanest way to create string literals
in a Java program containing such awkward characters.
--
Roedy Green Canadian Mind Products
http://mindprod.com
If you think it�s expensive to hire a professional to do the job, wait until you hire an amateur.
~ Red Adair (born: 1915-06-18 died: 2004-08-07 at age: 89)
The Java class java.lang.String uses UTF-16. For supplemental
characters (i.e. those that require more than 16 bits), you use
surrogate pairs. But each character in a pair is just a regular 16-bit
character of data. So you specify them in a String literal just like
you'd specify any other 16-bit literal data.
I haven't done a lot of experimentation with Java and Unicode text
source files, but it's possible you can just enter the characters
normally, and the compiler will handle things for you. Otherwise, for
sure you can always use the '\uXXXX' character literal syntax to specify
the characters. You'd just need a pair of such characters to specify a
single 32-bit character, using the appropriate surrogate pair rather
than the raw 32-bit character split in half.
See for more detail:
http://java.sun.com/javase/6/docs/api/java/lang/Character.html#unicode
http://java.sun.com/javase/6/docs/api/java/lang/String.html
Pete
Technically Unicode code points fit in the 0..0x10FFFF range, using
21 bits at most.
The JLS says this in section 3.1:
<< The Unicode standard was originally designed as a fixed-width 16-bit
character encoding. It has since been changed to allow for characters
whose representation requires more than 16 bits. The range of legal
code points is now U+0000 to U+10FFFF, using the hexadecimal U+n
notation. Characters whose code points are greater than U+FFFF are
called supplementary characters. To represent the complete range of
characters using only 16-bit units, the Unicode standard defines an
encoding called UTF-16. In this encoding, supplementary characters
are represented as pairs of 16-bit code units, the first from the
high-surrogates range, (U+D800 to U+DBFF), the second from the
low-surrogates range (U+DC00 to U+DFFF). For characters in the range
U+0000 to U+FFFF, the values of code points and UTF-16 code units are
the same.
The Java programming language represents text in sequences of 16-bit
code units, using the UTF-16 encoding. A few APIs, primarily in the
Character class, use 32-bit integers to represent code points as
individual entities. The Java platform provides methods to convert
between the two representations. >>
So basically, if you want to represent in the source code (e.g. in a
String literal) a code point beyond the first plane, then you use
a pair of \uxxxx sequences, for the two surrogates.
E.g., if you want to have a String literal with U+10C22 (that's
OLD TURKIC LETTER ORKHON EM; it somewhat looks like a fish),
then you first convert 0x10C22 to a surrogate pair:
1. subtract 0x10000: you get 0xC22
2. get the upper (u) and lower (l) 10 bits; you get u=0x3 and l=0x022
(i.e. (u << 10) + l == 0xC22)
3. the high surrogate is 0xD800 + u, the low surrogate is 0xDC00 + l.
Therefore you get this:
public class Foo {
public static final String BAR = "an old Turkic letter: \uD803\uDC22";
}
Note that this is an ASCII-compatible representation of a Java source
file, which conceptually consists in a sequence of 16-bit code units.
Now, with javac from Sun's JDK 1.6.0_16, I can use a UTF-8 representation
of the source code. This allows me to use old Turkic letters directly.
For instance, the encoding of the source Foo.java could look like
this:
00000000 70 75 62 6c 69 63 20 63 6c 61 73 73 20 46 6f 6f |public class Foo|
00000010 20 7b 0a 09 70 75 62 6c 69 63 20 73 74 61 74 69 | {..public stati|
00000020 63 20 66 69 6e 61 6c 20 53 74 72 69 6e 67 20 42 |c final String B|
00000030 41 52 20 3d 20 22 61 6e 20 6f 6c 64 20 54 75 72 |AR = "an old Tur|
00000040 6b 69 63 20 6c 65 74 74 65 72 3a 20 f1 80 b0 a2 |kic letter: ....|
00000050 22 3b 0a 7d 0a |";.}.|
we see that in the source code, the "f1 80 b0 a2" UTF-8 sequence was
used. Javac accepts this, and this yields the same Foo.class than
previously. I still recommand using the two \uxxxx sequences detailed
above, for maximum portability (ASCII works everywhere and is resilient
to the various abuse suffered by text in emails or Usenet messages).
You may want to look at the resulting .class file. In the classfiles,
a "modified UTF-8" format is used for String literals, in which
surrogates are encoded separately. Thus, regardless of how I gave the
old Turkic letter to the Java compiler, the .class file will contain
the 6-byte sequence "ed a3 83 ed b0 a2" (UTF-8 encoding of U+D803,
then UTF-8 encoding of U+DC22).
--Thomas Pornin
>E.g., if you want to have a String literal with U+10C22 (that's
>OLD TURKIC LETTER ORKHON EM; it somewhat looks like a fish),
>then you first convert 0x10C22 to a surrogate pair:
> 1. subtract 0x10000: you get 0xC22
> 2. get the upper (u) and lower (l) 10 bits; you get u=0x3 and l=0x022
> (i.e. (u << 10) + l == 0xC22)
> 3. the high surrogate is 0xD800 + u, the low surrogate is 0xDC00 + l.
That is what I was afraid of. I am doing that now to generate tables
of char entities and the equivalent hex and \u entities on various
pages of mindprod.com, e.g. http://mindprod.com/jgloss/html5.html
which shows the new HTML entities in HTML 5.
here is my code:
final int extract = theCharNumber - 0x10000;
final int high = ( extract >>> 10 ) + 0xd800;
final int low = ( extract & 0x3ff ) + 0xdc00;
sb.append( ""\\u" );
sb.append( StringTools.toLzHexString( high, 4 ) );
sb.append( "\\u" );
sb.append( StringTools.toLzHexString( low, 4 ) );
sb.append( """ );
I started to think about what would be needed to make this less
onerous.
1. an applet to convert hex to a surrogate pair.
2. allow \u12345 in string literals. However that would break
existing code. \u12345 currently means
"\u1234" + "5".
3. So you have to pick another letter: e.g. \c12345; for codepoint. IT
needs a terminator, so that in future it could also handle \c123456;
I don't know what that might break.
4. Introduce 32-bit CodePoint string literals with extensible \u
mechanism. E.g. CString b = c"\u12345;Hello";
5. specify weird chars with named entities to make the code more
readable. Entities in String literals would be translated to binary
at compile time, so the entities would not exist at run-time. The
HTML 5 set would be greatly extended to give pretty well every Unicode
glyph a name.
P.S. I have been poking around in HTML 5. W3C did an odd thing. They
REDEFINED the entities ⟨ and ⟩ to different glyphs from HTML
4. I don't think they have ever done anything like that before. I
hope it was just an error. I have written the W3C asking if they
really meant to do that.
>I started to think about what would be needed to make this less
>onerous.
If you had only a few, you could create library of named constants for
them, and glue them together with compile time concatenation. With
only a little cleverness, a compiler would avoid embedding constants
it did not use.
Is any OS, JVM, utility, browser etc. capable of rendering a code
point above 0xffff? I get the impression all we can do is embed them
in UTF-8 files.
IIRC, C99 introduced \uXXXX and \UXXXXXXXX.
--
ss at comp dot lancs dot ac dot uk
Oh yes, plenty.
Well, at least on my system (Linux with Ubuntu 9.10). For instance,
if I write this HTML file:
<html>
<body>
<p>🂓</p>
</body>
</html>
then both Firefox and Chromium display the "DOMINO TILE VERTICAL-06-06"
as they should. Now if I write this Java code:
public class Foo {
public static void main(String[] args)
{
StringBuilder sb = new StringBuilder();
sb.appendCodePoint(0x1F093);
System.out.println(sb.toString());
}
}
and run it in a standard terminal (GNOME Terminal 2.28.1 on that
system), then the domino tile is displayed. If I redirect the output in
a file, I can edit it just fine with the vim text editor; the domino
tile is being handled as a single character, just like it is supposed to
be.
Internally, C programs which wish to handle the full Unicode on Linux
use the 'wide character' type (wchar_t) which, on Linux, is defined to
be a 32-bit integer. Therefore there is nothing special with the 0xFFFF
limit. In practice, Unicode display trouble usually stem from limited
availability of fonts with exotic characters (although Linux has a fair
share of such fonts), double-width characters in monospace fonts, and
right-to-left scripts, all of which being orthogonal to the 16/32-bit
issue.
The same is not true in Windows, which switched to Unicode earlier, when
code points were 16-bit only; on Windows, wchar_t and the "wide string
literals" use 16-bit characters, and recent versions of Windows have to
resort to UTF-16 to process higher planes, just like Java. I have been
told that the OS is plainly able to process and display all of the
Unicode planes, but it can be expected that some applications are not up
to it yet.
C# is a late-comer (2001) but uses a 16-bit char type. This may be an
artefact of Java imitation. This may also be an attempt to ease
conversion of C or C++ code for Windows into C# code.
--Thomas Pornin
I have problems understanding why the surrogate code points are counted
twice: once as their code points isolated and then again as the code-points
that are reached by an adjacent pair of them.
In my understanding that would make 0x10F7FF really legal codepoints, as
the surrogates wouldn't be legal as single code points, but only as pairs.
But then again, perhaps my own understanding of "legal code points" just
differs from some common definition.
It makes defining UTF-16 easy and less error-prone.
Yet I guess the range of legal codepoints is still be U+0000 to
U+10FFFF, excluding the surrogates range in the middle.
--
Mayeul
> Thomas Pornin <por...@bolet.org> quoted the JLS section 3.1:
>> << The Unicode standard was originally designed as a fixed-width 16-bit
>> character encoding. It has since been changed to allow for characters
>> whose representation requires more than 16 bits. The range of legal
>> code points is now U+0000 to U+10FFFF
>
> I have problems understanding why the surrogate code points are counted
> twice: once as their code points isolated and then again as the code-points
> that are reached by an adjacent pair of them.
The range is a bound - all legal code points are inside it. It doesn't
mean that all numbers inside it are legal code points. There are plenty of
numbers which aren't mapped to any character, and so aren't legal code
points - the surrogates are just a special case of those. I reckon.
tom
--
X is for ... EXECUTION!
Thanks, that was my catch: I somehow mistakenly took "range" as implying
"all in the range" - and a codepoint with no char mapped to it wasn't
necessarily illegal in my mind, but single surrogate was.
Not all values from 0 to 0x10FFFF are legal code points by themselves.
For instance, 0xFFFE and 0xFFFF are explicitly defined to be illegal as
code point values, not only now but also for future versions of Unicode
(this makes BOM detection unambiguous).
Surrogates are not legal "alone" but it is quite handy the old Unicode
systems (those which only know of 16-bit code, such as Java pre-5) will
accept surrogates as just any other non-special code point: thus,
surrogate pairs can be smuggled into a 16-bit-only system, and that's
called UTF-16. This is somewhat equivalent to pushing UTF-8 data into an
application which expects ASCII: all ASCII characters keep the same
encoding, and we just hope that the application will just store the
bytes in the 0x80..0xF7 range unmolested.
--Thomas Pornin
>
>IIRC, C99 introduced \uXXXX and \UXXXXXXXX.
It would make sense to follow suit. Life is complicated enough already
for people who code in more than one language each day.
> On Tue, 22 Dec 2009 18:01:17 -0800, Roedy Green
> <see_w...@mindprod.com.invalid> wrote, quoted or indirectly quoted
> someone who said :
>
>> I started to think about what would be needed to make this less
>> onerous.
>
> If you had only a few, you could create library of named constants for
> them, and glue them together with compile time concatenation. With
> only a little cleverness, a compiler would avoid embedding constants
> it did not use.
>
>
> Is any OS, JVM, utility, browser etc. capable of rendering a code
> point above 0xffff? I get the impression all we can do is embed them
> in UTF-8 files.
OS X comes with fonts that contain glyphs for some (but not all)
characters above U+FFFF out of the box, and can render them anywhere
they appear. Their visibility in Swing apps depends heavily on the L&F;
if you don't force it, Java will default to the Aqua L&F and render
most things correctly.
Webapps, obviously, render nothing; they send encoded characters to
other things, which may render them. Safari, Chrome, and Firefox can
all render U+1D360 (COUNTING ROD UNIT DIGIT ONE).
In the interests of science, what characters do you see on the next line?
𐄀 𐅀 𐆐 𐌀 𐐀 𐑐 𝄡
This message is encoded as UTF-8, and those should be, in order,
Codepoint (UTF-8 representation) NAME
U+10100 (F0 90 84 80) AGEAN WORD SEPARATOR LINE
U+10140 (F0 90 85 80) GREEK ACROPHONIC ATTIC ONE QUARTER
U+10190 (F0 90 86 90) ROMAN SEXTANS SIGN
U+10300 (F0 90 8C 80) OLD ITALIC LETTER A
U+10400 (F0 90 90 80) DESERET CAPITAL LETTER LONG I
U+10450 (F0 90 91 90) SHAVIAN LETTER PEEP
U+1D121 (F0 9D 84 A1) MUSICAL SYMBOL C CLEF
with spaces between.
Cheers,
-o
Debian Lenny / browser Iceweasel 3.0.6 (Firefox re-branded for true
freedom ;)
I see boxes with tiny hexcode in them not corresponding to the
characters.
But then I can select them, past them in an xterm, where I see all
'? ? ? ? ?'
thinggies but then the file I pasted them in the terminal (using cat >
aa.txt)
contains the correct characters, as shown by an hexdump:
$ hexdump aa.txt
0000000 90f0 8084 f020 8590 2080 90f0 9086 f020
0000010 8c90 2080 90f0 8090 f020 9190 2090 9df0
0000020 a184 000a
:)
This
Here we've got a mix of Windows, Linux and OS X
devs so we're using scripts called at (Ant) build time that
enforces that all .java files:
a) use a subset of ASCII in their name
b) contains only ASCII characters
You can't build an app with non-ASCII characters in our
.java files and you certainly can't commit them :)
It's in the guidelines.
Better safe than sorry :)
> In the interests of science, what characters do you see on the next line?
>
> ? ? ? ? ? ? ?
Seven question marks.
Using Alpine 1.10 on Debian 5.0.3 accessed over OpenSSH 5.1p1 from iTerm
0.10 on OS X 10.4.11. Plus a few more layers i've forgotten, probably.
Easily enough for one of them to drop the unicode ball somewhere!
tom
--
science fiction, old TV shows, sports, food, New York City topography,
and golden age hiphop
> In the interests of science, what characters do you see on the next line?
>
> 𐄀 𐅀 𐆐 𐌀 𐐀 𐑐 𝄡
6 question marks and a [1/4].
I bet this has more to do with the news server we're each using than our
client's OS or newsreader. Vista/Thunderbird here.
Probably. I'm using Thunderbird on the Mac, and I see actual characters.
You can look at your message source to see what the article text
actually looks like as retrieved from your NNTP server. If the server
has left the raw data as posted by Owen alone, and if your Thunderbird
configuration is set correctly, I would expect you to see the actual
characters, just like I do.
Pete
>
>In the interests of science, what characters do you see on the next line?
>
>? ? ? ? ? ? ?
Using Agent with Windows 7 64 bit I just see ? marks.
--
Roedy Green Canadian Mind Products
http://mindprod.com
If you give someone a program, you will frustrate them for a day; if you teach them how to program, you will frustrate them for a lifetime.