Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

languages with full unicode support

133 views
Skip to first unread message

Xah Lee

unread,
Jun 25, 2006, 12:08:56 PM6/25/06
to
Languages with Full Unicode Support

As far as i know, Java and JavaScript are languages with full, complete
unicode support. That is, they allow names to be defined using unicode.
(the JavaScript engine used by FireFox support this)

As far as i know, here's few other lang's status:

C → No.
Python → No.
Perl → No.
Haskell → Yes by the spec, but no on existing compilers.
JavaScript → No in general. Firefox's engine do support it.
Lisps → No.
unix shells (bash) → No. (this probably applies to all unix shells)
Java → Yes and probably beats all. However, there may be a bug in 1.5
compiler.

Also, there appears to be a bug with Java 1.5's unicode support. The
following code compiles fine in 1.4, but under 1.5 the compiler
complains about the name x1.str★.

class 方 {
String str北 =
"北方有佳人,絕世而獨立。\n一顧傾人城,再顧傾人国。\n寧不知倾城与倾国。\n佳人難再得。";
String str★="θπαβγλϕρκψ ≤≥≠≈⊂⊃⊆⊇∈
ⅇⅈⅉ∞∆° ℵℜℂℝℚℙℤ ℓ∟∠∡ ∀∃ ∫∑∏
⊕⊗⊙⊚⊛∘∙ ★☆";

}

class UnicodeTest {
public static void main(String[] arg) {
方 x1 = new 方();
System.out.println( x1.str北 );
System.out.println( x1.str★ );
}
}

If you know a lang that does full unicode support, please let me know.
Thanks.

Xah
x...@xahlee.org
http://xahlee.org/

Frank Buss

unread,
Jun 25, 2006, 12:30:22 PM6/25/06
to
Xah Lee wrote:

> Lisps → No.

The Common Lisp spec (CLHS) doesn't require that implementations support
Unicode characters, but it doesn't forbid it and some implementations
support it, e.g. http://clisp.cons.org/impnotes.html

--
Frank Buss, f...@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de

Mumia W.

unread,
Jun 25, 2006, 3:27:35 PM6/25/06
to
Xah Lee wrote:
> Languages with Full Unicode Support
>
> As far as i know, Java and JavaScript are languages with full, complete
> unicode support. That is, they allow names to be defined using unicode.
> (the JavaScript engine used by FireFox support this)
>
> As far as i know, here's few other lang's status:
>
> C → No.
> Python → No.
> Perl → No.

Perl supports unicode in its core, and that include identifier names
using exotic characters.


> Haskell → Yes by the spec, but no on existing compilers.

Erm, isn't this an effective "No"?

> JavaScript → No in general. Firefox's engine do support it.
> Lisps → No.
> unix shells (bash) → No. (this probably applies to all unix shells)
> Java → Yes and probably beats all. However, there may be a bug in 1.5
> compiler.
>
> Also, there appears to be a bug with Java 1.5's unicode support. The
> following code compiles fine in 1.4, but under 1.5 the compiler
> complains about the name x1.str★.
>
> class 方 {
> String str北 =
> "北方有佳人,絕世而獨立。\n一顧傾人城,再顧傾人国。\n寧不知倾城与倾国。\n佳人難再得。";
> String str★="θπαβγλϕρκψ ≤≥≠≈⊂⊃⊆⊇∈
> ⅇⅈⅉ∞∆° ℵℜℂℝℚℙℤ ℓ∟∠∡ ∀∃ ∫∑∏
> ⊕⊗⊙⊚⊛∘∙ ★☆";
>
> }
>
> class UnicodeTest {
> public static void main(String[] arg) {
> 方 x1 = new 方();
> System.out.println( x1.str北 );
> System.out.println( x1.str★ );
> }
> }
>
> If you know a lang that does full unicode support, please let me know.
> Thanks.
>
> Xah
> x...@xahlee.org
> ∑ http://xahlee.org/

Perl is coming close to having full unicode support. '★' is not an
alphabetic or numeric character and has no place in an identifier. That
is why both Perl and Java reject it. Let's see what Perl can do:

#!/usr/bin/perl

use strict;
use warnings;
use utf8;

package 方;
our $str北="北方有佳人,絕世而獨立。\n一顧傾人城,再顧傾人国。"
. "\n寧不知倾城与倾国。\n佳人難再得。";

our $strβ = "θπαβγλϕρκψ ≤≥≠≈⊂⊃⊆⊇∈
ⅇⅈⅉ∞∆° ℵℜℂℝℚℙℤ ℓ∟∠∡ ∀∃ ∫∑∏
⊕⊗⊙⊚⊛∘∙ ★☆";

sub new {
my $class = shift;
my $self = {
str北 => \$str北,
'strβ' , \$strβ,
};
bless ($self, $class);
}

sub str北 {
${ (shift)->{str北} };
}

sub strβ {
${ (shift)->{strβ} };
};

package Test方;

sub do {
binmode STDOUT, 'utf8';
my $obj方 = 方->new();
$\ = "\n";
print $obj方->str北();
print '----------------';
print $obj方->strβ();
}

Test方->do();

Darren New

unread,
Jun 25, 2006, 4:22:46 PM6/25/06
to
Xah Lee wrote:
> If you know a lang that does full unicode support, please let me know.

Tcl. You may have to modify the "source" command to get it to default
to something other than the system encoding, but this is trivial in Tcl.

--
Darren New / San Diego, CA, USA (PST)
Native Americans used every part
of the buffalo, including the wings.

Oliver Bandel

unread,
Jun 25, 2006, 6:54:04 PM6/25/06
to

こんいちわ Xah-Lee san ;-)


Xah Lee wrote:

> Languages with Full Unicode Support
>
> As far as i know, Java and JavaScript are languages with full, complete
> unicode support. That is, they allow names to be defined using unicode.

Can you explain what you mena with the names here?


> (the JavaScript engine used by FireFox support this)
>
> As far as i know, here's few other lang's status:
>
> C → No.

Well, is this (only) a language issue?

On Plan-9 all things seem to be UTF-8 based,
and when you use C for programming, I would think
that C can handle this also.

But I only have read some papers about Plan-9 and did not developed on
it....

Only a try to have a different view on it.

If someone knows more, please let us know :)


Ciao,
Oliver

Xah Lee

unread,
Jun 25, 2006, 10:56:46 PM6/25/06
to
Mumia W. wrote:
[example of perl supporting unicode in var names]

oh shit, i was surprised but actually i knew it all along., just forgot
about it. :)
see
http://xahlee.org/perl-python/unicode.html

...

«use bytes; # Larry can take Unicode and shove it up his ass sideways.

# Perl 5.8.0 causes us to start getting incomprehensible
# errors about UTF-8 all over the place without this.»
—from the source code of WebCollage (1998),
by Jamie W Zawinski (~1971-)

What's Jamie yabbing about there?

Xah
x...@xahlee.org
http://xahlee.org/

OMouse

unread,
Jun 25, 2006, 11:56:46 PM6/25/06
to

> As far as i know, here's few other lang's status:
>
> C → No.

I think C has the wchar type to handle larger values. And C++ has
std::wstring. So really, the support is there.
http://www.cl.cam.ac.uk/~mgk25/unicode.html#c

I think the problem is that most C/C++ coders don't care about unicode
support and so they stick to char and std::string.

Oliver Wong

unread,
Jun 26, 2006, 10:29:53 AM6/26/06
to

"Oliver Bandel" <oli...@first.in-berlin.de> wrote in message
news:11512760...@elch.in-berlin.de...

>
> Xah Lee wrote:
>
>>
>> As far as i know, Java and JavaScript are languages with full, complete
>> unicode support. That is, they allow names to be defined using unicode.
>
> Can you explain what you mena with the names here?

As in variable names, function names, class names, etc.

- Oliver

Tin Gherdanarra

unread,
Jun 27, 2006, 12:09:54 PM6/27/06
to
Oliver Bandel wrote:
>
> こんいちわ Xah-Lee san ;-)

Uhm, I'd guess that Xah is Chinese. Be careful
with such things in real life; Koreans might
beat you up for this. Stay alive!


--
Lisp kann nicht kratzen, denn Lisp ist fluessig

Matthias Blume

unread,
Jun 27, 2006, 4:54:02 PM6/27/06
to
Tin Gherdanarra <tinma...@gmail.com> writes:

> Oliver Bandel wrote:
>> こんいちわ Xah-Lee san ;-)
>
> Uhm, I'd guess that Xah is Chinese. Be careful
> with such things in real life; Koreans might
> beat you up for this. Stay alive!

And the Japanese might beat him up, too. For butchering their
language. :-)

Tim Roberts

unread,
Jun 28, 2006, 3:24:48 AM6/28/06
to
"Xah Lee" <x...@xahlee.org> wrote:

>Languages with Full Unicode Support
>
>As far as i know, Java and JavaScript are languages with full, complete
>unicode support. That is, they allow names to be defined using unicode.
>(the JavaScript engine used by FireFox support this)
>
>As far as i know, here's few other lang's status:
>

>C ? No.

This is implementation-defined in C. A compiler is allowed to accept
variable names with alphabetic Unicode characters outside of ASCII.
--
- Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.

Joachim Durchholz

unread,
Jun 28, 2006, 5:39:16 AM6/28/06
to
Tim Roberts schrieb:

> "Xah Lee" <x...@xahlee.org> wrote:
>> C ? No.
>
> This is implementation-defined in C. A compiler is allowed to accept
> variable names with alphabetic Unicode characters outside of ASCII.

Hmm... that could would be nonportable, so C support for Unicode is
half-baked at best.

Regards,
Jo

David Hopwood

unread,
Jun 28, 2006, 7:03:05 AM6/28/06
to
Tim Roberts wrote:
> "Xah Lee" <x...@xahlee.org> wrote:
>
>>Languages with Full Unicode Support
>>
>>As far as i know, Java and JavaScript are languages with full, complete
>>unicode support. That is, they allow names to be defined using unicode.
>>(the JavaScript engine used by FireFox support this)
>>
>>As far as i know, here's few other lang's status:
>>
>>C ? No.
>
> This is implementation-defined in C. A compiler is allowed to accept
> variable names with alphabetic Unicode characters outside of ASCII.

It is not implementation-defined in C99 whether Unicode characters are
accepted; only how they are encoded directly in the source multibyte character
set.

Characters escaped using \uHHHH or \U00HHHHHH (H is a hex digit), and that
are in the sets of characters defined by Unicode for identifiers, are required
to be supported, and should be mangled in some consistent way by a platform's
linker. There are Unicode text editors which encode/decode \u and \U on the fly,
so you can treat this essentially like a Unicode transformation format (it
would have been nicer to require support for UTF-8, but never mind).


C99 6.4.2.1:

# 3 Each universal character name in an identifier shall designate a character
# whose encoding in ISO/IEC 10646 falls into one of the ranges specified in
# annex D. 59) The initial character shall not be a universal character name
# designating a digit. An implementation may allow multibyte characters that
# are not part of the basic source character set to appear in identifiers;
# which characters and their correspondence to universal character names is
# implementation-defined.
#
# 59) On systems in which linkers cannot accept extended characters, an encoding
# of the universal character name may be used in forming valid external
# identifiers. For example, some otherwise unused character or sequence of
# characters may be used to encode the \u in a universal character name.
# Extended characters may produce a long external identifier.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Chris Uppal

unread,
Jun 28, 2006, 7:38:04 AM6/28/06
to
Joachim Durchholz wrote:

> > This is implementation-defined in C. A compiler is allowed to accept
> > variable names with alphabetic Unicode characters outside of ASCII.
>
> Hmm... that could would be nonportable, so C support for Unicode is
> half-baked at best.

Since the interpretation of characters which are yet to be added to
Unicode is undefined (will they be digits, "letters", operators, symbol,
punctuation.... ?), there doesn't seem to be any sane way that a language could
allow an unrestricted choice of Unicode in identifiers. Hence, it must define
a specific allowed sub-set. C certainly defines an allowed subset of Unicode
characters -- so I don't think you could call its Unicode support "half-baked"
(not in that respect, anyway). A case -- not entirely convincing, IMO -- could
be made that it would be better to allow a wider range of characters.

And no, I don't think Java's approach -- where there /is no defined set of
allowed identifier characters/ -- makes any sense at all :-(

-- chris


David Hopwood

unread,
Jun 28, 2006, 9:56:18 AM6/28/06
to
Note Followup-To: comp.lang.java.programmer

Chris Uppal wrote:
> Since the interpretation of characters which are yet to be added to
> Unicode is undefined (will they be digits, "letters", operators, symbol,
> punctuation.... ?), there doesn't seem to be any sane way that a language could
> allow an unrestricted choice of Unicode in identifiers. Hence, it must define
> a specific allowed sub-set. C certainly defines an allowed subset of Unicode
> characters -- so I don't think you could call its Unicode support "half-baked"
> (not in that respect, anyway). A case -- not entirely convincing, IMO -- could
> be made that it would be better to allow a wider range of characters.
>
> And no, I don't think Java's approach -- where there /is no defined set of
> allowed identifier characters/ -- makes any sense at all :-(

Java does have a defined set of allowed identifier characters. However, you
certainly have to go around the houses a bit to work out what that set is:


<http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.8>

# An identifier is an unlimited-length sequence of Java letters and Java digits,
# the first of which must be a Java letter. An identifier cannot have the same
# spelling (Unicode character sequence) as a keyword (§3.9), boolean literal
# (§3.10.3), or the null literal (§3.10.7).
[...]
# A "Java letter" is a character for which the method
# Character.isJavaIdentifierStart(int) returns true. A "Java letter-or-digit"
# is a character for which the method Character.isJavaIdentifierPart(int)
# returns true.
[...]
# Two identifiers are the same only if they are identical, that is, have the
# same Unicode character for each letter or digit.

For Java 1.5.0:

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html>

# Character information is based on the Unicode Standard, version 4.0.

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isJavaIdentifierStart(int)>

# A character may start a Java identifier if and only if one of the following
# conditions is true:
#
# * isLetter(codePoint) returns true
# * getType(codePoint) returns LETTER_NUMBER
# * the referenced character is a currency symbol (such as "$")

[This means that getType(codePoint) returns CURRENCY_SYMBOL, i.e. Unicode
General Category Sc.]

# * the referenced character is a connecting punctuation character (such as "_").

[This means that getType(codePoint) returns CONNECTOR_PUNCTUATION, i.e. Unicode
General Category Pc.]

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isJavaIdentifierPart(int)>

# A character may be part of a Java identifier if any of the following are true:
#
# * it is a letter
# * it is a currency symbol (such as '$')
# * it is a connecting punctuation character (such as '_')
# * it is a digit
# * it is a numeric letter (such as a Roman numeral character)

[General Category Nl.]

# * it is a combining mark

[General Category Mc (see <http://www.unicode.org/versions/Unicode4.0.0/ch04.pdf>).]

# * it is a non-spacing mark

[General Category Mn (ditto).]

# * isIdentifierIgnorable(codePoint) returns true for the character

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isDigit(int)>

# A character is a digit if its general category type, provided by
# getType(codePoint), is DECIMAL_DIGIT_NUMBER.

[General Category Nd.]

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isIdentifierIgnorable(int)>

# The following Unicode characters are ignorable in a Java identifier or a Unicode
# identifier:
#
# * ISO control characters that are not whitespace
# o '\u0000' through '\u0008'
# o '\u000E' through '\u001B'
# o '\u007F' through '\u009F'
# * all characters that have the FORMAT general category value

[FORMAT is General Category Cf.]

<http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isLetter(int)>

# A character is considered to be a letter if its general category type, provided
# by getType(codePoint), is any of the following:
#
# * UPPERCASE_LETTER
# * LOWERCASE_LETTER
# * TITLECASE_LETTER
# * MODIFIER_LETTER
# * OTHER_LETTER

====

To cut a long story short, the syntax of identifiers in Java 1.5 is therefore:

Keyword ::= one of
abstract continue for new switch
assert default if package synchronized
boolean do goto private this
break double implements protected throw
byte else import public throws
case enum instanceof return transient
catch extends int short try
char final interface static void
class finally long strictfp volatile
const float native super while

Identifier ::= IdentifierChars butnot (Keyword | "true" | "false" | "null")
IdentifierChars ::= JavaLetter | IdentifierChars JavaLetterOrDigit
JavaLetter ::= Lu | Ll | Lt | Lm | Lo | Nl | Sc | Pc
JavaLetterOrDigit ::= JavaLetter | Nd | Mn | Mc |
U+0000..0008 | U+000E..001B | U+007F..009F | Cf

where the two-letter terminals refer to General Categories in Unicode 4.0.0
(exactly).

Note that the so-called "ignorable" characters (for which
isIdentifierIgnorable(codePoint) returns true) are not ignorable; they are
treated like any other identifier character. This quote from the API spec:

# The following Unicode characters are ignorable in a Java identifier [...]

should be ignored (no pun intended). It is contradicted by:

# Two identifiers are the same only if they are identical, that is, have the
# same Unicode character for each letter or digit.

in the language spec. Unicode does have a concept of ignorable characters in
identifiers, which is probably where this documentation bug crept in.

The inclusion of U+0000 and various control characters in the set of valid
identifier characters is also a dubious decision, IMHO.

Note that I am not defending in any way the complexity of this definition; there's
clearly no excuse for it (or for the "ignorable" documentation bug). The language
spec should have been defined directly in terms of the Unicode General Categories,
and then the API in terms of the language spec. They way it is done now is
completely backwards.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Joachim Durchholz

unread,
Jul 1, 2006, 3:46:50 AM7/1/06
to
Chris Uppal schrieb:

> Joachim Durchholz wrote:
>
>>> This is implementation-defined in C. A compiler is allowed to accept
>>> variable names with alphabetic Unicode characters outside of ASCII.
>> Hmm... that could would be nonportable, so C support for Unicode is
>> half-baked at best.
>
> Since the interpretation of characters which are yet to be added to
> Unicode is undefined (will they be digits, "letters", operators, symbol,
> punctuation.... ?), there doesn't seem to be any sane way that a language could
> allow an unrestricted choice of Unicode in identifiers.

I don't think this is a problem in practice. E.g. if a language uses the
usual definition for identifiers (first letter, then letters/digits),
you end up with a language that changes its definition on the whims of
the Unicode consortium, but that's less of a problem than one might
think at first.

I'd expect two kinds of changes in character categorization: additions
and corrections. (Any other?)

Additions are relatively unproblematic. Existing code will remain valid
and retain its semantics. The new characters will be available for new
programs.
There's a slight technological complication: the compiler needs to be
able to look up the newest definition. In other words, for a compiler to
run, it needs to be able to access http://unicode.org, or the language
infrastructure needs a way to carry around various revisions of the
Unicode tables and select the newest one.

Corrections are technically more problematic, but then we can rely on
the common sense of the programmers. If the Unicode consortium
miscategorized a character as a letter, the programmers that use that
character set will probably know it well enough to avoid its use. It
will probably not even occur to them that that character could be a
letter ;-)


Actually I'm not sure that Unicode is important for long-lived code.
Code tends to not survive very long unless it's written in English, in
which case anything outside of strings is in 7-bit ASCII. So the
majority of code won't ever be affected by Unicode problems - Unicode is
more a way of lowering entry barriers.

Regards,
Jo

Dr.Ruud

unread,
Jul 1, 2006, 6:51:27 AM7/1/06
to
Chris Uppal schreef:

> Since the interpretation of characters which are yet to be added to
> Unicode is undefined (will they be digits, "letters", operators,
> symbol, punctuation.... ?), there doesn't seem to be any sane way
> that a language could allow an unrestricted choice of Unicode in
> identifiers.

The Perl-code below prints:

xdigit
22 /194522 = 0.011% (lower: 6, upper: 6)
ascii
128 /194522 = 0.066% (lower: 26, upper: 26)
\d
268 /194522 = 0.138%
digit
268 /194522 = 0.138%
IsNumber
612 /194522 = 0.315%
alpha
91183 /194522 = 46.875% (lower: 1380, upper: 1160)
alnum
91451 /194522 = 47.013% (lower: 1380, upper: 1160)
word
91801 /194522 = 47.193% (lower: 1380, upper: 1160)
graph
102330 /194522 = 52.606% (lower: 1380, upper: 1160)
print
102349 /194522 = 52.616% (lower: 1380, upper: 1160)
blank
18 /194522 = 0.009%
space
24 /194522 = 0.012%
punct
374 /194522 = 0.192%
cntrl
6473 /194522 = 3.328%


Especially look at 'word', the same as \w, which for ASCII is
[0-9A-Za-z_].


==8<===================
#!/usr/bin/perl
# Program-Id: unicount.pl
# Subject: show Unicode statistics

use strict ;
use warnings ;

use Data::Alias ;

binmode STDOUT, ':utf8' ;

my @table =
# +--Name------+---qRegexp--------+-C-+-L-+-U-+
(
[ 'xdigit' , qr/[[:xdigit:]]/ , 0 , 0 , 0 ] ,
[ 'ascii' , qr/[[:ascii:]]/ , 0 , 0 , 0 ] ,
[ '\\d' , qr/\d/ , 0 , 0 , 0 ] ,
[ 'digit' , qr/[[:digit:]]/ , 0 , 0 , 0 ] ,
[ 'IsNumber' , qr/\p{IsNumber}/ , 0 , 0 , 0 ] ,
[ 'alpha' , qr/[[:alpha:]]/ , 0 , 0 , 0 ] ,
[ 'alnum' , qr/[[:alnum:]]/ , 0 , 0 , 0 ] ,
[ 'word' , qr/[[:word:]]/ , 0 , 0 , 0 ] ,
[ 'graph' , qr/[[:graph:]]/ , 0 , 0 , 0 ] ,
[ 'print' , qr/[[:print:]]/ , 0 , 0 , 0 ] ,
[ 'blank' , qr/[[:blank:]]/ , 0 , 0 , 0 ] ,
[ 'space' , qr/[[:space:]]/ , 0 , 0 , 0 ] ,
[ 'punct' , qr/[[:punct:]]/ , 0 , 0 , 0 ] ,
[ 'cntrl' , qr/[[:cntrl:]]/ , 0 , 0 , 0 ] ,
) ;

my @codepoints =
(
0x0000 .. 0xD7FF,
0xE000 .. 0xFDCF,
0xFDF0 .. 0xFFFD,
0x10000 .. 0x1FFFD,
0x20000 .. 0x2FFFD,
# 0x30000 .. 0x3FFFD, # etc.
) ;

for my $row ( @table )
{
alias my ($name, $qrx, $count, $lower, $upper) = @$row ;

printf "\n%s\n", $name ;

my $n = 0 ;

for ( @codepoints )
{
local $_ = chr ; # int-2-char conversion
$n++ ;

if ( /$qrx/ )
{
$count++ ;
$lower++ if / [[:lower:]] /x ;
$upper++ if / [[:upper:]] /x ;
}
}

my $show_lower_upper =
($lower || $upper)
? sprintf( " (lower:%6d, upper:%6d)"
, $lower
, $upper
)
: '' ;

printf "%6d /%6d =%7.3f%%%s\n"
, $count
, $n
, 100 * $count / $n
, $show_lower_upper
}
__END__

--
Affijn, Ruud

"Gewoon is een tijger."


David Hopwood

unread,
Jul 1, 2006, 9:20:52 AM7/1/06
to
Joachim Durchholz wrote:
> Chris Uppal schrieb:
>> Joachim Durchholz wrote:
>>
>>>> This is implementation-defined in C. A compiler is allowed to accept
>>>> variable names with alphabetic Unicode characters outside of ASCII.
>>>
>>> Hmm... that could would be nonportable, so C support for Unicode is
>>> half-baked at best.
>>
>> Since the interpretation of characters which are yet to be added to
>> Unicode is undefined (will they be digits, "letters", operators, symbol,
>> punctuation.... ?), there doesn't seem to be any sane way that a
>> language could allow an unrestricted choice of Unicode in identifiers.
>
> I don't think this is a problem in practice. E.g. if a language uses the
> usual definition for identifiers (first letter, then letters/digits),
> you end up with a language that changes its definition on the whims of
> the Unicode consortium, but that's less of a problem than one might
> think at first.

It is not a problem at all. See the stability policies in
<http://www.unicode.org/reports/tr31/tr31-2.html>.

> Actually I'm not sure that Unicode is important for long-lived code.
> Code tends to not survive very long unless it's written in English, in
> which case anything outside of strings is in 7-bit ASCII. So the
> majority of code won't ever be affected by Unicode problems - Unicode is
> more a way of lowering entry barriers.

Unicode in identifiers has certainly been less important than some thought
it would be -- and not at all important for open source projects, for example,
which essentially have to use English to get the widest possible participation.

--
David Hopwood <david.nosp...@blueyonder.co.uk>

Oliver Bandel

unread,
Jul 2, 2006, 12:20:19 PM7/2/06
to
Matthias Blume wrote:

OK, back to ISO-8859-1 :) no one needs so much symbols,
this is enough: äöüÄÖÜß :)


Ciao,
Oliver

Matthias Blume

unread,
Jul 2, 2006, 6:28:50 PM7/2/06
to
Oliver Bandel <oli...@first.in-berlin.de> writes:

>>>Oliver Bandel wrote:
>>>
>>>>こんいちわ Xah-Lee san ;-)
>>>
>>>Uhm, I'd guess that Xah is Chinese. Be careful
>>>with such things in real life; Koreans might
>>>beat you up for this. Stay alive!
>> And the Japanese might beat him up, too. For butchering their
>> language. :-)
>
> OK, back to ISO-8859-1 :) no one needs so much symbols,
> this is enough: äöüÄÖÜß :)

There are plenty of people who need such symbols (more people than
those who need ß, btw).

Matthias

PS: It should have been こんにちは.

Joachim Durchholz

unread,
Jul 4, 2006, 3:22:01 AM7/4/06
to
Oliver Bandel schrieb:

If you want äöüÄÖÜß, anybody else will want their local characters, too,
and nothing below full Unicode will work.

Just for laughs, here's a list of non-ASCII Latin-based letters in
Unicode (not verified for completeness):
ÀÁÂÃÄÅÆàáâãäåæĀāĂ㥹ǺǻǼǽ
ÇçĆćĈĉĊċČč
ĎďĐđ
ÈÉÊËèéêëĒēĔĕĖėĘęĚě
ĜĝĞğĠġĢģ
ĤĥĦħ
ÌÍÎÏìíîïĨĩĪīĬĭĮįİıIJij
Ĵĵ
Ķķĸ
ĹĺĻļĽĿŀŁł
Ðð
ÑñŃńŅņŇňʼnŊŋ
ÒÓÔÕØòóôöõŌōŎŏÖŐőŒœǾǿ
ŔŕŖŗŘř
ŚśŜŝŞşŠšß
ŢţŤťŦŧ
ÜÙÚÛüùúûŨũŪūŭŮůŰűŲų
Ŵŵ
ÝýÿŶŷŸ
Þþ
ŹźŻżŽž
ƒſ
ISO 8859-1 covers just a fraction of these, so Unicode would indeed be
necessary to allow a program written in one country to compile in
another one.

Regards,
Jo

Pascal Bourguignon

unread,
Jul 4, 2006, 6:19:45 AM7/4/06
to
Joachim Durchholz <j...@durchholz.org> writes:

Indeed, far from complete:

(coerce (lschar :name "LATIN") 'string)
--> "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóô
õöøùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩ
ĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝ
ŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƀƁƂƃƄƅƆƇƈƉƊƋƌƍƎƏƐƑ
ƒƓƔƕƖƗƘƙƚƛƜƝƞƟƠơƢƣƤƥƦƧƨƩƪƫƬƭƮƯưƱƲƳƴƵƶƷƸƹƺƻƼƽƾƿǀǁǂǃDŽDž
džLJLjljNJNjnjǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǝǞǟǠǡǢǣǤǥǦǧǨǩǪǫǬǭǮǯǰDZDzdzǴǵǶǷǸǹ
ǺǻǼǽǾǿȀȁȂȃȄȅȆȇȈȉȊȋȌȍȎȏȐȑȒȓȔȕȖȗȘșȚțȜȝȞȟȠȢȣȤȥȦȧȨȩȪȫȬȭȮ
ȯȰȱȲȳɐɑɒɓɔɕɖɗɘəɚɛɜɝɞɟɠɡɢɣɤɥɦɧɨɩɪɫɬɭɮɯɰɱɲɳɴɵɶɷɸɹɺɻɼɽɾ
ɿʀʁʂʃʄʅʆʇʈʉʊʋʌʍʎʏʐʑʒʓʔʕʖʗʘʙʚʛʜʝʞʟʠʡʢʣʤʥʦʧʨʩʪʫʬʭ
ḀḁḂḃḄḅḆḇḈḉḊḋḌḍḎḏḐḑḒḓḔḕḖḗḘḙḚḛḜḝḞḟḠḡḢḣḤḥḦḧḨḩḪḫ
ḬḭḮḯḰḱḲḳḴḵḶḷḸḹḺḻḼḽḾḿṀṁṂṃṄṅṆṇṈṉṊṋṌṍṎṏṐṑṒṓṔṕṖṗṘṙṚṛṜṝṞṟ
ṠṡṢṣṤṥṦṧṨṩṪṫṬṭṮṯṰṱṲṳṴṵṶṷṸṹṺṻṼṽṾṿẀẁẂẃẄẅẆẇẈẉẊẋẌẍẎẏẐẑẒẓ
ẔẕẖẗẘẙẚẛẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặẸẹẺẻẼẽẾếỀềỂểỄễỆệỈỉỊị
ỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợỤụỦủỨứỪừỬửỮữỰựỲỳỴỵỶỷỸỹⁱⁿ⒜⒝⒞⒟
⒠⒡⒢⒣⒤⒥⒦⒧⒨⒩⒪⒫⒬⒭⒮⒯⒰⒱⒲⒳⒴⒵ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏⓐⓑⓒⓓ
ⓔⓕⓖⓗⓘⓙⓚⓛⓜⓝⓞⓟⓠⓡⓢⓣⓤⓥⓦⓧⓨⓩ✝✞✟fffiflffifflſtstABCDEFGHIJ
KLMNOPQRSTUVWXYZabcdefghij
klmnopqrstuvwxyz"


--
__Pascal Bourguignon__ http://www.informatimago.com/

READ THIS BEFORE OPENING PACKAGE: According to certain suggested
versions of the Grand Unified Theory, the primary particles
constituting this product may decay to nothingness within the next
four hundred million years.

Mumia W.

unread,
Jul 4, 2006, 9:49:18 AM7/4/06
to
Pascal Bourguignon wrote:
> [...]

> (coerce (lschar :name "LATIN") 'string)
> --> "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
> АБВГДЕЖЗИЙКЛМНОПРСТУФХЦШЩЪЫЬЭЮЯабвгдежзийклмнопрстуф
> [...]

In what programming language/interpreter is this code?

David Squire

unread,
Jul 4, 2006, 9:55:32 AM7/4/06
to

Looks pretty lispy to me... and comp.lang.lisp was in the original
newsgroups list.


DS

Pascal Bourguignon

unread,
Jul 4, 2006, 12:05:46 PM7/4/06
to
"Mumia W." <mumia.w.18.spa...@earthlink.net> writes:

Common Lisp; in particular, clisp:

(defun string-match-p (pattern string)
"Matches a string."
#+(and clisp regexp) (regexp:match pattern string :ignore-case t)
#-(and clisp regexp) (search pattern string
:test (function equalp)))

(defun lschar (&key (start 0) (end #x11000) name)
"Prints all the characters of codes betwen start and end, with their names."
(if name
(loop :for code :from start :below end
:when (string-match-p name (char-name (code-char code)))
:collect #1=(progn (format t "#x~5,'0X ~:*~6D ~C ~S~%"
code (code-char code)
(char-name (code-char code)))
(code-char code)))
(loop :for code :from start :below end :collect #1#)))

--
__Pascal Bourguignon__ http://www.informatimago.com/

This is a signature virus. Add me to your signature and help me to live.

Dale King

unread,
Jul 5, 2006, 2:00:07 PM7/5/06
to
Tim Roberts wrote:
> "Xah Lee" <x...@xahlee.org> wrote:
>
>> Languages with Full Unicode Support
>>
>> As far as i know, Java and JavaScript are languages with full, complete
>> unicode support. That is, they allow names to be defined using unicode.
>> (the JavaScript engine used by FireFox support this)
>>
>> As far as i know, here's few other lang's status:
>>
>> C ? No.
>
> This is implementation-defined in C. A compiler is allowed to accept
> variable names with alphabetic Unicode characters outside of ASCII.

I don't think it is implementation defined. I believe it is actually
required by the spec. The trouble is that so few compilers actually
comply with the spec. A few years ago I asked for someone to actually
point to a fully compliant compiler and no one could.

--
Dale King

Tim Roberts

unread,
Jul 5, 2006, 11:51:05 PM7/5/06
to
Dale King <Dale...@gmail.com> wrote:
>Tim Roberts wrote:
>> "Xah Lee" <x...@xahlee.org> wrote:
>>
>>> Languages with Full Unicode Support
>>>
>>> As far as i know, Java and JavaScript are languages with full, complete
>>> unicode support. That is, they allow names to be defined using unicode.
>>> (the JavaScript engine used by FireFox support this)
>>
>> This is implementation-defined in C. A compiler is allowed to accept
>> variable names with alphabetic Unicode characters outside of ASCII.
>
>I don't think it is implementation defined. I believe it is actually
>required by the spec.

C99 does have a list of Unicode codepoints that are required to be accepted
in identifiers, although implementations are free to accept other
characters as well. For example, few people realize that Visual C++
accepts the dollar sign $ in an identifier.

0 new messages