Unicode and Common Lisp

Juanjo

unread,

Jan 11, 2009, 7:17:26 AM1/11/09

to

Hi,

I am slowly adding support for Unicode and external formats to ECL and
found several incompatibilities between the Unicode specification and
ANSI Common Lisp.

Most of them are related to letter cases and the fact that this is
forced to be an invertible transformation in CL and it is not in
Unicode. The version of SBCL I have solves this by only using char-
upcase/downcase on characters where the transformation is one-to-one.

The other place is string comparisons or "collation". Unicode demands
a normalization process before comparing, so that the placement of
different composing characters (accents, marks, etc), does not affect
comparison.

Another place is newline endings, which in Unicode is not just CR or
LF, or a combination of both, while ANSI CL demands a single #\Newline
character.

My question is what is the level of support that the different
implementations provide and whether you consider that the ANSI
specification has become obsolete in this respect and should be
probably ignored -- perhaps with a configuration flag that selects a
Unicode conformant behavior.

Juanjo

Tobias C. Rittweiler

unread,

Jan 11, 2009, 7:35:40 AM1/11/09

to

Juanjo <juanjose.g...@googlemail.com> writes:

I haven't come around reading it properly myself, but

``Extensions to Common LISP to Support International Character Sets''

(http://common-lisp.net/~trittweiler/x3j13-char-proposal.pdf)

may answer your questions, or at least provide some insights to the
decisions that were made during standardization.

-T.

Juanjo

unread,

Jan 11, 2009, 8:46:03 AM1/11/09

to

On Jan 11, 1:35 pm, "Tobias C. Rittweiler" <t...@freebits.de.invalid>
wrote:

> I haven't come around reading it properly myself, but
>
> ``Extensions to Common LISP to Support International Character Sets''
>
> (http://common-lisp.net/~trittweiler/x3j13-char-proposal.pdf)
>
> may answer your questions, or at least provide some insights to the
> decisions that were made during standardization.

Seems that many of the issues I mentioned, either were explicitely
avoided (lexicographic ordering, string comparison) or were not
considered (not one-to-one mappings). But an interesting text
nevertheless.

Juanjo

Marco Antoniotti

unread,

Jan 12, 2009, 4:37:19 AM1/12/09

to

Hi

I think the right question to ask is whether there is sufficient
critical mass and consensus to write a CDR or a document that Franz
and Lispworks can agree to implement.

Cheers
--
Marco
www.european-lisp-symposium.org

On Jan 11, 3:46 pm, Juanjo <juanjose.garciarip...@googlemail.com>
wrote:

du...@franz.com

unread,

Jan 12, 2009, 11:19:30 AM1/12/09

to

On Jan 12, 1:37 am, Marco Antoniotti <marc...@gmail.com> wrote:
> Hi
>
> I think the right question to ask is whether there is sufficient
> critical mass and consensus to write a CDR or a document that Franz
> and Lispworks can agree to implement.

At Franz we chose to follow the consensus of the programming community
at large when we chose to go with Unicode characters (though not full-
sized - we use 16 bit characters), and Posix Locales. See
http://www.franz.com/support/documentation/8.1/doc/iacl.htm which is
entirely devoted to international character sets, and especially the
section on locales: http://www.franz.com/support/documentation/8.1/doc/iacl.htm#locales-1

Duane

Marco Antoniotti

unread,

Jan 13, 2009, 6:23:56 AM1/13/09

to

On Jan 12, 5:19 pm, du...@franz.com wrote:
> On Jan 12, 1:37 am, Marco Antoniotti <marc...@gmail.com> wrote:
>
> > Hi
>
> > I think the right question to ask is whether there is sufficient
> > critical mass and consensus to write a CDR or a document that Franz
> > and Lispworks can agree to implement.
>
> At Franz we chose to follow the consensus of the programming community
> at large when we chose to go with Unicode characters (though not full-

> sized - we use 16 bit characters), and Posix Locales. Seehttp://www.franz.com/support/documentation/8.1/doc/iacl.htmwhich is

> entirely devoted to international character sets, and especially the
> section on locales:http://www.franz.com/support/documentation/8.1/doc/iacl.htm#locales-1
>

Yes. But where does that leave the "CL community at large"? I am not
questioning your choices; I am just griping (as usual) about he fact
that what you did may or may not be followed (or anticipated) by the
other implementations.

Cheers
--
Marco
www.european-lisp-symposium.org

du...@franz.com

unread,

Jan 13, 2009, 12:40:12 PM1/13/09

to

On Jan 13, 3:23 am, Marco Antoniotti <marc...@gmail.com> wrote:
> On Jan 12, 5:19 pm, du...@franz.com wrote:
>
> > On Jan 12, 1:37 am, Marco Antoniotti <marc...@gmail.com> wrote:
>
> > > Hi
>
> > > I think the right question to ask is whether there is sufficient
> > > critical mass and consensus to write a CDR or a document that Franz
> > > and Lispworks can agree to implement.
>
> > At Franz we chose to follow the consensus of the programming community
> > at large when we chose to go with Unicode characters (though not full-
> > sized - we use 16 bit characters), and Posix Locales. Seehttp://www.franz.com/support/documentation/8.1/doc/iacl.htmwhichis
> > entirely devoted to international character sets, and especially the
> > section on locales:http://www.franz.com/support/documentation/8.1/doc/iacl.htm#locales-1
>
> Yes. But where does that leave the "CL community at large"?

The CL community is composed of three types of people: vendors, users,
and complainers. All are mix-and-match. I know you; you have always
tended to be a user. And users have a lot of power; they get to vote
with their feet (so to speak).

I am not
> questioning your choices; I am just griping (as usual) about he fact
> that what you did may or may not be followed (or anticipated) by the
> other implementations.

Yes, it sometimes frustrates me to see how slow other vendors are to
accept obviously superior concepts :-) although there are many
concepts that we've put out there that have made it into some
implementations (e.g. simple-streams, fwrappers, extended function
names, hierarchical packages, ...). All we can do is to try, and it
always directly helps our own customers (and the users of our free
version)

As for our use of standardized techniques: it seems to me that such
attention to what the rest of the industry is doing is one way that
provides a higher probability that it will be useful.

Duane

Mark Wooding

unread,

Jan 13, 2009, 6:26:20 PM1/13/09

to

du...@franz.com <du...@franz.com> wrote:

> At Franz we chose to follow the consensus of the programming community
> at large when we chose to go with Unicode characters (though not full-
> sized - we use 16 bit characters)

As far as I can see, that leaves you with three choices:

* abandon code-points outside the Base Multilingual Plane forever;

* expand your characters and hope that nothing breaks too badly; or

* expose the nightmare of surrogate pairs to programmers because you
lied when you said that a CHARACTER could actually hold a whole
character.

Which did you go for?

-- [mdw]

du...@franz.com

unread,

Jan 14, 2009, 3:24:24 AM1/14/09

to

On Jan 13, 3:26 pm, Mark Wooding <m...@distorted.org.uk> wrote:
> du...@franz.com <du...@franz.com> wrote:
> > At Franz we chose to follow the consensus of the programming community
> > at large when we chose to go with Unicode characters (though not full-
> > sized - we use 16 bit characters)
>
> As far as I can see, that leaves you with three choices:

Why only three?

> * abandon code-points outside the Base Multilingual Plane forever;

So negative. Nothing has to be forever. We make choices all the time,
and at the time we made the choice for 16 bits, it seemed the right
balance between completeness and space usage. We still offer an 8-bit
version, because even only going to 16 bits we caught some flak for it
due to the space it uses. Perhaps someday the balance will tip in
favor of a few more bits.

> * expand your characters and hope that nothing breaks too badly; or

Nah; no more breakage than between our 8 and 16-bit offerings.

> * expose the nightmare of surrogate pairs to programmers because you
> lied when you said that a CHARACTER could actually hold a whole
> character.

We try never to lie.

> Which did you go for?

Ah, yes, and "have you stopped beating your wife yet?"

Duane

Juanjo

unread,

Jan 14, 2009, 4:10:46 AM1/14/09

to

On Jan 12, 5:19 pm, du...@franz.com wrote:

> At Franz we chose to follow the consensus of the programming community
> at large when we chose to go with Unicode characters (though not full-

> sized - we use 16 bit characters), and Posix Locales. Seehttp://www.franz.com/support/documentation/8.1/doc/iacl.htmwhich is

> entirely devoted to international character sets, and especially the
> section on locales:http://www.franz.com/support/documentation/8.1/doc/iacl.htm#locales-1

Posix locales tend to be a pita and very limited, for there can only
be one locale per application that has to be activated globally.
However, that is not really what I was looking for.

Unicode, even at 16 bits, introduces composing characters, such as
accents and other character marks, that make string comparison more
complex than just comparing code points, because two code points may
be equivalent to a precomposed one, or the order of composing codes
may be irrelevant. String upcasing, downcasing and titlecasing differ,
but would still be nicely matched to the corresponding lisp functions.
String collation is implemented via a different algorithm that
involves sorting tables.

Out of these and some other features, Allegro only seems to focus on
the latest, and only marginally via two special purpose functions, for
the implementation is really based on Posix locales and I presume the
string comparison functions from the C library.

My question was more along the line of whether all that above could be
painlessly integrated into the Common Lisp standard in a sensible way,
but it seems no implementation has done that so far.

Again, as I said, it may be the case that there is little need for it,
but it may also be the case that it is just a matter of some library
or person taking the first step -- for the algorithms, AFAI can see,
are not that complex.

Juanjo

unread,

Jan 14, 2009, 4:19:51 AM1/14/09

to

On Jan 14, 12:26 am, Mark Wooding <m...@distorted.org.uk> wrote:
> du...@franz.com <du...@franz.com> wrote:
> > At Franz we chose to follow the consensus of the programming community
> > at large when we chose to go with Unicode characters (though not full-
> > sized - we use 16 bit characters)
>
> As far as I can see, that leaves you with three choices:
>
> * abandon code-points outside the Base Multilingual Plane forever;
>
> * expand your characters and hope that nothing breaks too badly; or

If an implementation supports 16 bits characters it is quite likely
that it will also work with more. According to Unicode any program is
free to choose a subset of the character set that it supports, from 8
to 16 to 24 bits or whatever.

An implementation that restricts itself to 16 bits should, however,
emit some kind of warning when it reads a surrogate pair, for that is
not going to be properly handled by the implementation, but if you
look at the character tables, that is very unlikely to happen --
unless you go for musical or mathematical scripts, or dead languages.

A more important point is in my opinion that there is no support for
string normalizations and comparisons, and that this is not done
transparently by the lisp. For the record, the _same_ character has
more than one possible encoding. For instance, A with a circle above,
can be U+212B, U+00C5 or U+0041+U+030A, and by Unicode standard the
three are canonically equivalent -- string comparison should thus
return T.

I do not know the level of support for Unicode that other languages
provide. Java seems to have the IBM ICU library, which is also
available in C/C++, but whether this is used and how often, escapes my
knowledge.

Juanjo

David Lichteblau

unread,

Jan 14, 2009, 5:24:33 AM1/14/09

to

On 2009-01-14, Juanjo <juanjose.g...@googlemail.com> wrote:
> If an implementation supports 16 bits characters it is quite likely
> that it will also work with more. According to Unicode any program is
> free to choose a subset of the character set that it supports, from 8
> to 16 to 24 bits or whatever.
>
> An implementation that restricts itself to 16 bits should, however,
> emit some kind of warning when it reads a surrogate pair, for that is
> not going to be properly handled by the implementation, but if you
> look at the character tables, that is very unlikely to happen --
> unless you go for musical or mathematical scripts, or dead languages.

While I'm not a fan of 16 bit characters, I don't agree with a need for
warnings when reading surrogates.

An implementation that uses 16 bit characters must leave surrogates
as-is when reading UTF-16 data from a file, while an implementation that
uses >= 21 bit characters must assemble them into a code point.

In neither case is the Unicode handling "improper", so a warning should
not be issued. What situation do you have in mind where a warning would
be appropriate?

The erroneous situation is the use of actual Lisp characters
corresponding to code points for surrogates in an implementation that
has >= 21 bit characters. Clozure CL gets it right: CODE-CHAR returns
NIL for surrogates. SBCL gets it wrong and returns #\UD800. (Allegro
also gets it right: It returns a character, but since it has 16 bit
characters, that is correct.)

Of course, 16 bit characters (i.e., use of UTF-16 in Lisp strings) is
not an ideal compromise: It combines the disadvantages of UTF-8 with the
disadvantages of UTF-32. Like UTF-8, it is a variable-length encoding
and doesn't actually provide an equivalence between Unicode code points
and Lisp characters. And like full 21 bit characters (i.e., use of
UTF-32 in Lisp strings), it is not a space efficient representation.

And variable length encodings like UTF-8 and UTF-16 mean that
string-related algorithms have to match and manipulate substrings, not
individual characters. But due to combining characters, Unicode-aware
applications can't implement string algorithms on a
character-by-character basis anyway, so variable length encodings
actually introduce no additional complications. Hence, for most
programming languages, use of UTF-8 as an internal representation is
ideal: It's space efficient and doesn't introduce much run-time
overhead.

Unfortunately, Lisp specifies strings to be arrays of characters, and
users can reasonably assume that arrays support O(1) access to arbitrary
arrays elements, so UTF-8 is not an attractive internal representation
in Lisp.

That leaves UTF-16 and UTF-32 to choose from. While UTF-16 has the
flaws mentioned above, the more important consideration to me seems
compatibility. The free Lisps chose UTF-32, the commercial Lisps chose
UTF-16, so you get to pick which set of implementations you want to be
compatible with.

The result is a nightmare for anyone writing Unicode-conforming code. I
recently did some work to fix surrogate handling in cxml and related
libraries. The resulting #+/-lisp-uses-surrogates conditionalization is
not pretty. (Since XML is specified to support all of Unicode, and test
suites check for that, my code wouldn't pass important test suites if
surrogate handling was missing. So I can't just take the position that
surrogates are unimportant in the real world and ignore them.)

d.

Juanjo

unread,

Jan 14, 2009, 6:24:28 AM1/14/09

to

On Jan 14, 11:24 am, David Lichteblau <usenet-2...@lichteblau.com>
wrote:

> While I'm not a fan of 16 bit characters, I don't agree with a need for
> warnings when reading surrogates.
>
> An implementation that uses 16 bit characters must leave surrogates
> as-is when reading UTF-16 data from a file, while an implementation that
> uses >= 21 bit characters must assemble them into a code point.
>
> In neither case is the Unicode handling "improper", so a warning should
> not be issued. What situation do you have in mind where a warning would
> be appropriate?

When that text is to be interpreted as a string. You just said that
code-char returns nil for a code in the surrogate region, but what is
an application to do when it finds a surrogate at a place in a string?
The question is that Unicode explicitely permits your application or
implementation to restrict itself to a subset of characters and thus
treat surrogate pairs as ignorable -- note that I mean the pair, not
just the surrogate codepoint. This would lead to a consistent behavior
for an implementation that is not able to properly interpret
characters outside the BMP and would thus return wrong properties for
them.

In other words, you may restrict yourself to 16 bits without using
UTF-16 encoding. Whether your implementation issues warnings or uses
replacement characters for unsupported surrogate pairs, or simply
ignores them, it is something that should be documented.

This is not the default behavior that I have chosen for ECL, and it
may not be what _you_ or your library users need, but I understand
that it could be a valid configuration option for people with certain
memory requirements or which need wide character strings that are
compatible with the requirements of its operating system.

David Lichteblau

unread,

Jan 14, 2009, 8:15:24 AM1/14/09

to

On 2009-01-14, Juanjo <juanjose.g...@googlemail.com> wrote:

> When that text is to be interpreted as a string. You just said that
> code-char returns nil for a code in the surrogate region, but what is
> an application to do when it finds a surrogate at a place in a string?

There would be no such character (that's why CODE-CHAR returns NIL), so
this situation cannot occur.

[...]

> This is not the default behavior that I have chosen for ECL, and it
> may not be what _you_ or your library users need, but I understand
> that it could be a valid configuration option for people with certain
> memory requirements or which need wide character strings that are
> compatible with the requirements of its operating system.

Okay.

The trouble is that Lisp libraries need to accept strings created by
other Lisp code in the same image. Any restriction on characters not
imposed by the implementation are lingering problems that need to be
dealt with by the callee.

I'm mostly interested in being able to re-use existing code without
having to change it all over the place.

Will Babel (http://common-lisp.net/project/babel/) handle strings with
surrogates in them correctly on ECL? Will CL-UNICODE
(http://www.weitz.de/cl-unicode/) handle strings with surrogates in them
correctly on ECL?

d.

Juanjo

unread,

Jan 14, 2009, 8:39:24 AM1/14/09

to

On Jan 14, 2:15 pm, David Lichteblau <usenet-2...@lichteblau.com>
wrote:

> On 2009-01-14, Juanjo <juanjose.garciarip...@googlemail.com> wrote:
>
> > When that text is to be interpreted as a string. You just said that
> > code-char returns nil for a code in the surrogate region, but what is
> > an application to do when it finds a surrogate at a place in a string?
>
> There would be no such character (that's why CODE-CHAR returns NIL), so
> this situation cannot occur.

I do not find this answer satisfactory: an implementation reads a
sequence of code units (that is bare (unsigned-byte 16) words) in
UTF-16 encoding. It stores internally strings using 16-bit words and
it finds a surrogate pair in the input stream: it will be read and
stored in a string. Now you say that the surrogate does not exist as
character. How is the surrogate pair stored? Is it simply ignored?
Does it store the high and low surrogate code points? Does it store a
replacement character?

> I'm mostly interested in being able to re-use existing code without
> having to change it all over the place.

My wish as well -- that is why this thread was started.

> Will Babel (http://common-lisp.net/project/babel/) handle strings with
> surrogates in them correctly on ECL? Will CL-UNICODE
> (http://www.weitz.de/cl-unicode/) handle strings with surrogates in them
> correctly on ECL?

ECL uses internally UTF-32, so there are no surrogates to handle.
Babel will become unnecessary at least in ECL, due to the support of
external formats and (soon) stream on sequence types.

Now, as for CL-UNICODE, it currently includes no support for strings
-- it only works at the character level, no notion of UTF-8, -16, etc,
and it works only with characters and code-points assuming the
implementation supports all of Unicode.

My idea, and what I have posted to the cl-unicode mailing list (no
answer so far), would be that the algorithms for dealing with Unicode
strings should be part of cl-unicode. This library should provide drop-
in replacements for things like character and string comparisons, plus
additional functions for normalization, handling locales, etc.

Juanjo

jos...@corporate-world.lisp.de

unread,

Jan 14, 2009, 8:57:52 AM1/14/09

to

On 14 Jan., 14:39, Juanjo <juanjose.garciarip...@googlemail.com>
wrote:

It might be useful to add information to:

http://www.cliki.net/Unicode%20support

http://www.cliki.net/Unicode%20and%20Lisp

It would be useful to have a some

* guideline for Lisp implementors
* a list of the various problems supporting Unicode in CL
* possible problems with the integration of Unicode in ANSI CL
* reasoning behind design decisions

David Lichteblau

unread,

Jan 14, 2009, 9:34:00 AM1/14/09

to

On 2009-01-14, Juanjo <juanjose.g...@googlemail.com> wrote:

> I do not find this answer satisfactory: an implementation reads a
> sequence of code units (that is bare (unsigned-byte 16) words) in
> UTF-16 encoding. It stores internally strings using 16-bit words and
> it finds a surrogate pair in the input stream: it will be read and
> stored in a string. Now you say that the surrogate does not exist as
> character.

No, I agree with you. If the implementation uses UTF-16 internally,
surrogates are characters.

> ECL uses internally UTF-32, so there are no surrogates to handle.
> Babel will become unnecessary at least in ECL, due to the support of
> external formats and (soon) stream on sequence types.

That's great. (I expect portable code to use libraries like Babel for
various use cases anyway, but native external format code is
important to have.)

> My idea, and what I have posted to the cl-unicode mailing list (no
> answer so far), would be that the algorithms for dealing with Unicode
> strings should be part of cl-unicode. This library should provide drop-
> in replacements for things like character and string comparisons, plus
> additional functions for normalization, handling locales, etc.

That would be nice.

du...@franz.com

unread,

Jan 14, 2009, 1:27:07 PM1/14/09

to

On Jan 14, 1:10 am, Juanjo <juanjose.garciarip...@googlemail.com>
wrote:

> On Jan 12, 5:19 pm, du...@franz.com wrote:
>
> > At Franz we chose to follow the consensus of the programming community
> > at large when we chose to go with Unicode characters (though not full-
> > sized - we use 16 bit characters), and Posix Locales. Seehttp://www.franz.com/support/documentation/8.1/doc/iacl.htmwhichis
> > entirely devoted to international character sets, and especially the
> > section on locales:http://www.franz.com/support/documentation/8.1/doc/iacl.htm#locales-1
>
> Posix locales tend to be a pita and very limited, for there can only
> be one locale per application that has to be activated globally.
> However, that is not really what I was looking for.

To be clear; we got the locale definitions from Posix (actually, from
IBM), but we're not actually using Posix locales. We have our own
locale object that is dynamically rebindable. Thus, we can have
multiple locales within a single lisp application.

Duane

Xah Lee

unread,

Jan 14, 2009, 2:18:34 PM1/14/09

to

On Jan 11, 4:17 am, Juanjo <juanjose.garciarip...@googlemail.com>
wrote:

btw, emacs lisp has the best unicode support out of the box of all
programing lang i'm a expert of. (perl, python, php, java. (haven't
used unicode with Mathematica much, but it's abilities for arbitrary
chars and math symbols is built in before unicode became popular by
some 10 or 20 years)

Here's a very ROUGH ordered list of langs in terms of quality of
unicode support, from my experience:

emacs lisp, java, javascript, php, python, perl. (Common Lisp,
Haskell, Scheme, would be after python or perl)

Xah
∑ http://xahlee.org/

☄

Mark Wooding

unread,

Jan 14, 2009, 4:22:22 PM1/14/09

to

David Lichteblau <usene...@lichteblau.com> writes:

> If the implementation uses UTF-16 internally, surrogates are
> characters.

But a surrogate isn't a character. It's half a character, an encoding
artifact with no more right to be called a character than a byte of a
multibyte UTF-8 encoding. This is what I meant when I called having
CHARACTER store a UTF-16 encoding unit a lie. What's CHAR-UPCASE of a
surrogate?

This is the mess that Java got itself into by adopting Unicode early and
overspecifying its `char' type. (That C# followed Java in this regard
despite having the benefit of hindsight is criminal. But I digress.)

-- [mdw]

David Lichteblau

unread,

Jan 15, 2009, 8:42:50 AM1/15/09

to

On 2009-01-14, Mark Wooding <m...@distorted.org.uk> wrote:
> But a surrogate isn't a character. It's half a character, an encoding
> artifact with no more right to be called a character than a byte of a
> multibyte UTF-8 encoding. This is what I meant when I called having
> CHARACTER store a UTF-16 encoding unit a lie. What's CHAR-UPCASE of a
> surrogate?

Right, but CHAR-UPCASE isn't well defined even if the implementation has
21 bit characters and disallows characters corresponding to surrogates,
because combining characters need to be taken into account, so
Unicode-aware code needs to upcase or downcase strings, not characters.

Unfortunately, STRING-UPCASE and STRING-DOWNCASE are specified to work
on a character-by-character basis, so they can't be used either. An
implementation (or a portable library) could offer alternative functions
STRING-UPCASE and STRING-DOWNCASE in a separate package which can return
results of a different length.

UTF-16 might be silly anyway, but doesn't make things worse in this case.

d.