characters in CL

Mark Tarver

unread,

Mar 18, 2011, 4:19:32 AM3/18/11

to

This came up in qilang. I thought the fastest way to get some light
on this would be to throw it up for general discussion here in Lisp.
Is there a weakness in CL here?

Note the initial part of the post quoted is talking about Javascript/
Python.

QUOTE
"\u661F" (星) is a string.

The same type as "\u65E5\u751F" (日生).

'\u661F' is the same as "\u661F" except for ' vs " quoting.

Javascript and python haven't quite escaped the conceptual trap of
characters, since "\u661f" is a string of length 1, and
"\u6535\u751f"
is a string of length 2, but by having their 'characters' of the same
class as their strings the problem is reduced.

You can think of them as having strings composed of substrings, with
a
fixed composition mechanism.

Ideally the composition mechanism would be variable, but even this
compromise is a big step up from characters vs. strings in the CL
sense.

Any CL system dealing with characters as unicode code-points will
break when exposed to anything requiring combining character
sequences
-- e.g., Tibetan, Thai, lots of Indian scripts, etc.

With javascript, python, etc, in most cases you should at least be
able to pass through sequences of characters where the programmer
anticipated a single character without problems.

Even doing case conversion in German breaks in CL, since string-
upcase
can't map "ss" to ß, since it operates on a character by character
basis.
UNQUOTE

Opinions?

Mark

Espen Vestre

unread,

Mar 18, 2011, 4:42:18 AM3/18/11

to

Mark Tarver <dr.mt...@ukonline.co.uk> writes:

> Even doing case conversion in German breaks in CL, since string-
> upcase can't map "ss" to ß, since it operates on a character by
> character basis.

I guess you mean map ß to SS? I don't know if that's really a big
problem, since the opposite conversion (SS to ss or ß) is so hard
(impossible to implement without a full knowledge of the ss vs ß grammar
rules)....
--
(espen)

Paul Rubin

unread,

Mar 18, 2011, 5:03:03 AM3/18/11

to

Espen Vestre <es...@vestre.net> writes:
>> Even doing case conversion in German breaks in CL, since string-
>> upcase can't map "ss" to ß, since it operates on a character by
>> character basis.
>
> I guess you mean map ß to SS? I don't know if that's really a big
> problem, since the opposite conversion (SS to ss or ß) is so hard
> (impossible to implement without a full knowledge of the ss vs ß grammar
> rules)....

According to my unicode chart popup, ß (U+00DF) upcases to ẞ(U+1E9E).

Pascal J. Bourguignon

unread,

Mar 18, 2011, 5:34:15 AM3/18/11

to

Mark Tarver <dr.mt...@ukonline.co.uk> writes:

> This came up in qilang. I thought the fastest way to get some light
> on this would be to throw it up for general discussion here in Lisp.
> Is there a weakness in CL here?

I'm not convinced that characters are a bad thing. Unicode code-points,
perhaps (well perhaps not a bad thing, but a low level technicality),
but I don't a problem in distinguishing characters from strings.

In anycase, if you prefer substrings, nothing prevents you to use SUBSEQ
instead of CHAR or AREF.

> Any CL system dealing with characters as unicode code-points will
> break when exposed to anything requiring combining character
> sequences
> -- e.g., Tibetan, Thai, lots of Indian scripts, etc.

Indeed. IMO, correct treatment of unicode in CL requires that CL
characters be defined as combining character sequences (probably
normalized, like MacOSX likes to do it). Notice that the CL standard
explicitely says that (not (eq "é" "é")) is possible, there's a reason,
and now we know it: unicode. It also means that CL characters, when
supporting unicode, must be more complex a structure than mere fixnums.
Deal with it!

> With javascript, python, etc, in most cases you should at least be
> able to pass through sequences of characters where the programmer
> anticipated a single character without problems.
>
> Even doing case conversion in German breaks in CL, since string-upcase
> can't map "ss" to ß, since it operates on a character by character
> basis.

STRING-UPCASE wouldn't have a problem with "ss" -> "ß", but
NSTRING-UPCASE would.

However, STRING-UPCASE is not defined to perform a localized text
upcasing, but in terms of CHAR-UPCASE, which is defined in terms of
LOWER-CASE-P and UPPER-CASE-P (more precisely both CHAR-UPCASE and
LOWER-CASE-P are defined in terms of the same glossary entry).

In any case, I repeat, the standard CL functions are not defined in
terms of localized text.

--
__Pascal Bourguignon__ http://www.informatimago.com/
A bad day in () is better than a good day in {}.

Tim Bradshaw

unread,

Mar 18, 2011, 5:55:07 AM3/18/11

to

On 2011-03-18 08:19:32 +0000, Mark Tarver said:

>
> You can think of them as having strings composed of substrings, with
> a
> fixed composition mechanism.

I think that this kind of thing is basically a disaster for a language
which wants to have a coherent type system. You really, I think, have
two options:

strings are a special magic type orthogonal to any other type (in
particular they are not arrays or sequences);

strings are some kind of sequence/array type.

For a language which is not entirely about string bashing the latter
option is the obvious one. Then you have to ask the question: what
type are strings sequencs of? The answer can't be "strings" because
then the type system falls to bits in some horrible way: you need
another type (or the option of another type: strings could be sequences
whose elements are (strings OR <something>).) That other type, in CL,
is characters.

The other option, having strings *not* be sequences but some special
magic type is possible, and I suspect it's basically what Perl does,
for instance (in so far as Perl has a coherent type system at all). I
would not want CL to do this.

I don't know how this maps on to unicode: I do know that unicode has
lots of cases where complicated things happen, and I also know that I
don't understand it. Unfortunately the only person I knew who I was
sure really *did* understand it is dead, so we can't get his opinion.
I don't think that the sharp-s / ss thing in German has much to do with
this, because there are lots of complicated rules around that (which I
can no longer remember). It may be that assuming things like
string-upcase / char-upcase &c are simple is just a huge mistake.

--tim

Espen Vestre

unread,

Mar 18, 2011, 6:34:32 AM3/18/11

to

Paul Rubin <no.e...@nospam.invalid> writes:

> According to my unicode chart popup, ß (U+00DF) upcases to ẞ(U+1E9E).

Yes, but AFAIK this is not yet part of the official german grammar
rules, which prescribe ß -> SS.
--
(espen)

Joost Kremers

unread,

Mar 18, 2011, 6:34:43 AM3/18/11

to

Paul Rubin wrote:
> According to my unicode chart popup, ß (U+00DF) upcases to ẞ(U+1E9E).

but currently, german spelling rules state that ß upcases to SS; officially,
there is no uppercase ß in german.

--
Joost Kremers joostk...@yahoo.com
Selbst in die Unterwelt dringt durch Spalten Licht
EN:SiS(9)

Paul Rubin

unread,

Mar 18, 2011, 6:57:38 AM3/18/11

to

Joost Kremers <joostk...@yahoo.com> writes:
> but currently, german spelling rules state that ß upcases to SS; officially,
> there is no uppercase ß in german.

Hmm, ok. I thought I had seen ß in upcased words but I haven't been
exposed to that much German. There's a Wikipedia article:

http://de.wikipedia.org/wiki/Gro%C3%9Fes_%C3%9F
http://en.wikipedia.org/wiki/Capital_%C3%9F

Tamas K Papp

unread,

Mar 18, 2011, 7:46:09 AM3/18/11

to

I agree. I don't think it is reasonable to expect the standard
library (eg of CL) to deal with the idiosyncratic rules of natural
languages, especially considering that there are many languages and
their rules sometime change (German changed recently).

CL implementations can surely _represent_ text, especially with
Unicode, and the library functions can handle simple operations for
languages that follow the patterns of English. I guess it is up to
libraries to handle more complex stuff.

Tamas

D Herring

unread,

Mar 18, 2011, 9:06:16 AM3/18/11

to

On 03/18/2011 04:19 AM, Mark Tarver wrote:
> This came up in qilang. I thought the fastest way to get some light
> on this would be to throw it up for general discussion here in Lisp.
> Is there a weakness in CL here?

...
> Opinions?

Relevant reading:
http://www.tbray.org/ongoing/When/200x/2003/04/13/Strings
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
http://www.tbray.org/ongoing/When/200x/2003/04/30/JavaStrings
http://www.tbray.org/ongoing/When/200x/2003/05/17/Yooster
http://www.w3.org/TR/charmod/

General consensus is that utf-32's only advantage is that it is
slightly easier to decode than other variants. It is not useful for
string indexing, merging, substitution, etc. Unless you only care
about ASCII.

- Daniel

Joost Kremers

unread,

Mar 18, 2011, 9:30:43 AM3/18/11

to

Paul Rubin wrote:
> Joost Kremers <joostk...@yahoo.com> writes:
>> but currently, german spelling rules state that ß upcases to SS; officially,
>> there is no uppercase ß in german.
>
> Hmm, ok. I thought I had seen ß in upcased words

yes, it does appear sometimes, it's just not officially sanctioned by the Rat
für Deutsche Rechtschreibung (Council for German Orthography). (though the
german wikipedia article on ß <http://de.wikipedia.org/wiki/%C3%9F> suggests
this may change in the future.)

D Herring

unread,

Mar 19, 2011, 1:28:02 AM3/19/11

to

On 03/18/2011 09:06 AM, D Herring wrote:
> On 03/18/2011 04:19 AM, Mark Tarver wrote:
>> This came up in qilang. I thought the fastest way to get some light
>> on this would be to throw it up for general discussion here in Lisp.
>> Is there a weakness in CL here?
> ...
>> Opinions?
>
> Relevant reading:
> http://www.tbray.org/ongoing/When/200x/2003/04/13/Strings
> http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
> http://www.tbray.org/ongoing/When/200x/2003/04/30/JavaStrings
> http://www.tbray.org/ongoing/When/200x/2003/05/17/Yooster
> http://www.w3.org/TR/charmod/

A bit more I didn't have time to post earlier.

I think the thing that distinguishes string datatypes from vectors of
numbers is the explicit expectation that strings be human readable.
This extends to the expectation that the language can manipulate
strings in a "natural" way (e.g. things like word splitting,
capitalization, and text rendering).

A "correct" solution results in a heavy library like ICU.
http://site.icu-project.org/

As Tim Bray notes, much code simply needs a fast solution that simply
treats strings as an opaque bag-o-bits to be copied and handed to
other code for rendering. So the correct solution isn't always the
best one.

January was an interesting month for string discussion on the Boost
devel list. Here are the main threads I remember. While there are
the usual flames and such, there are some good posts in here.

http://thread.gmane.org/gmane.comp.lib.boost.devel/214601/
http://thread.gmane.org/gmane.comp.lib.boost.devel/213026/
http://thread.gmane.org/gmane.comp.lib.boost.devel/213235/
http://thread.gmane.org/gmane.comp.lib.boost.devel/213726/
http://thread.gmane.org/gmane.comp.lib.boost.devel/214701/

One last thing I wanted to add: as specified, unicode (especially
utf8) is somewhat like a Huffman code optimized for ASCII. Unicode
needs a universal external format; but a given text rarely contains a
random mix of characters. Thus I suspect that many programs would
benefit from using a locale-specific internal format.

I think that's all I'll say unless people have specific questions.

- Daniel

Aidan Kehoe

unread,

Mar 19, 2011, 10:37:20 AM3/19/11

to

Ar an naoú lá déag de mí Márta, scríobh D. Herring:

> One last thing I wanted to add: as specified, unicode (especially utf8) is
> somewhat like a Huffman code optimized for ASCII. Unicode needs a universal
> external format; but a given text rarely contains a random mix of
> characters. Thus I suspect that many programs would benefit from using a
> locale-specific internal format.

We’re stuck with UTF-8. The closest thing to a locale-specific internal
format would be something like windows-1251 or Shift-JIS, and using them
leads to data loss because not everyone using Windows-1251 will stick to
that subset of Cyrillic plus ASCII, and not everyone using Shift-JIS will
stick to ASCII and the relevant Kanji and kana.

> I think that's all I'll say unless people have specific questions.
>
> - Daniel

--
“Apart from the nine-banded armadillo, man is the only natural host of
Mycobacterium leprae, although it can be grown in the footpads of mice.”
-- Kumar & Clark, Clinical Medicine, summarising improbable leprosy research

Jeremiah Stoddard

unread,

Mar 19, 2011, 10:54:35 AM3/19/11

to

On 03/19/2011 07:37 AM, Aidan Kehoe wrote:
>
> Ar an naoú lá déag de mí Márta, scríobh D. Herring:
>
> > One last thing I wanted to add: as specified, unicode (especially utf8) is
> > somewhat like a Huffman code optimized for ASCII. Unicode needs a universal
> > external format; but a given text rarely contains a random mix of
> > characters. Thus I suspect that many programs would benefit from using a
> > locale-specific internal format.
>
> We’re stuck with UTF-8. The closest thing to a locale-specific internal
> format would be something like windows-1251 or Shift-JIS, and using them
> leads to data loss because not everyone using Windows-1251 will stick to
> that subset of Cyrillic plus ASCII, and not everyone using Shift-JIS will
> stick to ASCII and the relevant Kanji and kana.

No need to use windows-1251 or anything along those lines. One can map
arbitrarily to an array of 8-bit elements, or 16-bit for a language
that needs a lot of characters. You get the performance benefit of
fixed-length characters if you're doing a lot of work that involves
finding a character in a specific location, and then you convert back
to UTF-8 when you save to a file, transmit over the network, or
whatever you're going to do. No data loss at all.

Jeremiah Stoddard

Anticomuna

unread,

Mar 19, 2011, 11:14:13 AM3/19/11

to

On Mar 18, 5:19 am, Mark Tarver <dr.mtar...@ukonline.co.uk> wrote:
> Any CL system dealing with characters as unicode code-points will
> break when exposed to anything requiring combining character
> sequences
> -- e.g., Tibetan, Thai, lots of Indian scripts, etc.
>
> With javascript, python, etc, in most cases you should at least be
> able to pass through sequences of characters where the programmer
> anticipated a single character without problems.
>
> Even doing case conversion in German breaks in CL, since string-
> upcase
> can't map "ss" to ß, since it operates on a character by character
> basis.

Only if you use some function that specifically operates on a
character per character basis. I think nstring-upcase or char-upcase
would fall into this category.

But for others, such as string-upcase, it is just a matter of having a
locale sensitive operation that will correctly replace your ss with ß.

No "breaking", it is just a bug introduced by the developer.

Pascal J. Bourguignon

unread,

Mar 19, 2011, 11:35:47 AM3/19/11

to

Anticomuna <ts.con...@uol.com.br> writes:

> But for others, such as string-upcase, it is just a matter of having a
> locale sensitive operation that will correctly replace your ss with ß.

Yes, something that is not string-upcase, since string-upcase is
specified to work character by character.

Tim Bradshaw

unread,

Mar 19, 2011, 12:08:13 PM3/19/11

to

On 2011-03-19 14:37:20 +0000, Aidan Kehoe said:

>> Thus I suspect that many programs would benefit from using a
>> locale-specific internal format.
>
> We’re stuck with UTF-8.

I'm very confused by this. Do you intend "internal format" to mean
"the format of strings in memory" or "a private file format for an
application", or something else? Do there exist CL implementations
whish use UTF-8 for strings?

Anticomuna

unread,

Mar 19, 2011, 1:43:12 PM3/19/11

to

On Mar 19, 12:35 pm, "Pascal J. Bourguignon" <p...@informatimago.com>
wrote:

> Anticomuna <ts.concei...@uol.com.br> writes:
> > But for others, such as string-upcase, it is just a matter of having a

> > locale sensitive operation that will correctly replace your ss with .

>
> Yes, something that is not string-upcase, since string-upcase is
> specified to work character by character.
>
> --
> __Pascal Bourguignon__ http://www.informatimago.com/
> A bad day in () is better than a good day in {}.

Well, CL was created in a time where internationalization wasn't much
of a concern and that's why it just ignores it. It is possible for an
implementor to change the string-upcase function to handle locales
appropriately.

The nstring-upcase wouldn't allow it because it just alters an
existing string. Char-upcase wouldn't allow it because it returns just
one character at a time.

But my point was that what the OP said has nothing to do with Unicode.
All is needed is a proper library.

Aidan Kehoe

unread,

Mar 19, 2011, 3:52:46 PM3/19/11

to

Ar an naoú lá déag de mí Márta, scríobh Tim Bradshaw:

> On 2011-03-19 14:37:20 +0000, Aidan Kehoe said:
>

> > Ar an naoú lá déag de mí Márta, scríobh D. Herring:
>
> >> One last thing I wanted to add: as specified, unicode (especially
> >> utf8) is somewhat like a Huffman code optimized for ASCII. Unicode
> >> needs a universal external format; but a given text rarely contains a

> >> random mix of characters. Thus I suspect that many programs would
> >> benefit from using a locale-specific internal format. Thus I suspect

> >> that many programs would benefit from using a locale-specific
> >> internal format.
>
> > We’re stuck with UTF-8.
>
> I'm very confused by this. Do you intend "internal format" to mean "the
> format of strings in memory" or "a private file format for an application",
> or something else?

I didn’t write “internal format”, but I understood it as *mostly* in the
first sense.

I mean, we *could* develop Unicode transformation formats that are optimised
for particular scripts, and the PRC’s GB18030 is such a transformation
format, in a sense. But no-one’s going to, there’s not sufficient benefit
from a locale-specific internal format that supports all of Unicode.

> Do there exist CL implementations whish use UTF-8 for strings?

I’ve restored some context you snipped above. I don’t know if any CL
implementations use UTF-8 for strings, but some certainly use Unicode, which
is where D. Herring is coming from there.

Tim Bradshaw

unread,

Mar 19, 2011, 5:10:08 PM3/19/11

to

On 2011-03-19 19:52:46 +0000, Aidan Kehoe said:
>
> I didn’t write “internal format”, but I understood it as *mostly* in the
> first sense.

I apologise, I think I've misparsed the messages.

What I read was D. Herring sauing "... many programs could benefit from
using a locale-specific internal format." and you replying "We’re stuck
with UTF-8." to which I assumed was added an implicit "... as an
internal format", obviously incorrectly.

Obviously many (perhaps all) CLs use Unicode now (and presumably all of
those support UTF-8 as an external format), but I can't imagine how
they would use UTF-8 for strings.

Pascal J. Bourguignon

unread,

Mar 20, 2011, 3:21:50 AM3/20/11

to

Anticomuna <ts.con...@uol.com.br> writes:

> On Mar 19, 12:35 pm, "Pascal J. Bourguignon" <p...@informatimago.com>
> wrote:
>> Anticomuna <ts.concei...@uol.com.br> writes:
>> > But for others, such as string-upcase, it is just a matter of having a
>> > locale sensitive operation that will correctly replace your ss with .
>>
>> Yes, something that is not string-upcase, since string-upcase is
>> specified to work character by character.
>>
>> --
>> __Pascal Bourguignon__ http://www.informatimago.com/
>> A bad day in () is better than a good day in {}.
>
> Well, CL was created in a time where internationalization wasn't much
> of a concern and that's why it just ignores it. It is possible for an
> implementor to change the string-upcase function to handle locales
> appropriately.

Ok, in the case of string-upcase, since it accepts keyword arguments, it
could be extended with a :localized-for parameter.

> The nstring-upcase wouldn't allow it because it just alters an
> existing string. Char-upcase wouldn't allow it because it returns just
> one character at a time.

This is not the reason. The reason why it's possible is because
string-upcase is defined to take keyword arguments. Therefore it's
possible for the implemention to define new keyword arguments, and for
the program to call it with :allow-other-keys t :localised-for :tibetan
arguments.

> But my point was that what the OP said has nothing to do with Unicode.
> All is needed is a proper library.

Agreed, there's no need to change CL:STRING-UPCASE when we can have a
LOCALIZATION:TEXT-UPCASE function at the same time.