I am slowly adding support for Unicode and external formats to ECL and found several incompatibilities between the Unicode specification and ANSI Common Lisp.
Most of them are related to letter cases and the fact that this is forced to be an invertible transformation in CL and it is not in Unicode. The version of SBCL I have solves this by only using char- upcase/downcase on characters where the transformation is one-to-one.
The other place is string comparisons or "collation". Unicode demands a normalization process before comparing, so that the placement of different composing characters (accents, marks, etc), does not affect comparison.
Another place is newline endings, which in Unicode is not just CR or LF, or a combination of both, while ANSI CL demands a single #\Newline character.
My question is what is the level of support that the different implementations provide and whether you consider that the ANSI specification has become obsolete in this respect and should be probably ignored -- perhaps with a configuration flag that selects a Unicode conformant behavior.
> I am slowly adding support for Unicode and external formats to ECL and > found several incompatibilities between the Unicode specification and > ANSI Common Lisp.
> Most of them are related to letter cases and the fact that this is > forced to be an invertible transformation in CL and it is not in > Unicode. The version of SBCL I have solves this by only using char- > upcase/downcase on characters where the transformation is one-to-one.
> The other place is string comparisons or "collation". Unicode demands > a normalization process before comparing, so that the placement of > different composing characters (accents, marks, etc), does not affect > comparison.
> Another place is newline endings, which in Unicode is not just CR or > LF, or a combination of both, while ANSI CL demands a single #\Newline > character.
> My question is what is the level of support that the different > implementations provide and whether you consider that the ANSI > specification has become obsolete in this respect and should be > probably ignored -- perhaps with a configuration flag that selects a > Unicode conformant behavior.
I haven't come around reading it properly myself, but
``Extensions to Common LISP to Support International Character Sets''
> may answer your questions, or at least provide some insights to the > decisions that were made during standardization.
Seems that many of the issues I mentioned, either were explicitely avoided (lexicographic ordering, string comparison) or were not considered (not one-to-one mappings). But an interesting text nevertheless.
I think the right question to ask is whether there is sufficient critical mass and consensus to write a CDR or a document that Franz and Lispworks can agree to implement.
> > may answer your questions, or at least provide some insights to the > > decisions that were made during standardization.
> Seems that many of the issues I mentioned, either were explicitely > avoided (lexicographic ordering, string comparison) or were not > considered (not one-to-one mappings). But an interesting text > nevertheless.
On Jan 12, 1:37 am, Marco Antoniotti <marc...@gmail.com> wrote:
> Hi
> I think the right question to ask is whether there is sufficient > critical mass and consensus to write a CDR or a document that Franz > and Lispworks can agree to implement.
> On Jan 12, 1:37 am, Marco Antoniotti <marc...@gmail.com> wrote:
> > Hi
> > I think the right question to ask is whether there is sufficient > > critical mass and consensus to write a CDR or a document that Franz > > and Lispworks can agree to implement.
Yes. But where does that leave the "CL community at large"? I am not questioning your choices; I am just griping (as usual) about he fact that what you did may or may not be followed (or anticipated) by the other implementations.
> > On Jan 12, 1:37 am, Marco Antoniotti <marc...@gmail.com> wrote:
> > > Hi
> > > I think the right question to ask is whether there is sufficient > > > critical mass and consensus to write a CDR or a document that Franz > > > and Lispworks can agree to implement.
> Yes. But where does that leave the "CL community at large"?
The CL community is composed of three types of people: vendors, users, and complainers. All are mix-and-match. I know you; you have always tended to be a user. And users have a lot of power; they get to vote with their feet (so to speak).
I am not
> questioning your choices; I am just griping (as usual) about he fact > that what you did may or may not be followed (or anticipated) by the > other implementations.
Yes, it sometimes frustrates me to see how slow other vendors are to accept obviously superior concepts :-) although there are many concepts that we've put out there that have made it into some implementations (e.g. simple-streams, fwrappers, extended function names, hierarchical packages, ...). All we can do is to try, and it always directly helps our own customers (and the users of our free version)
As for our use of standardized techniques: it seems to me that such attention to what the rest of the industry is doing is one way that provides a higher probability that it will be useful.
du...@franz.com <du...@franz.com> wrote: > At Franz we chose to follow the consensus of the programming community > at large when we chose to go with Unicode characters (though not full- > sized - we use 16 bit characters)
As far as I can see, that leaves you with three choices:
* abandon code-points outside the Base Multilingual Plane forever;
* expand your characters and hope that nothing breaks too badly; or
* expose the nightmare of surrogate pairs to programmers because you lied when you said that a CHARACTER could actually hold a whole character.
On Jan 13, 3:26 pm, Mark Wooding <m...@distorted.org.uk> wrote:
> du...@franz.com <du...@franz.com> wrote: > > At Franz we chose to follow the consensus of the programming community > > at large when we chose to go with Unicode characters (though not full- > > sized - we use 16 bit characters)
> As far as I can see, that leaves you with three choices:
Why only three?
> * abandon code-points outside the Base Multilingual Plane forever;
So negative. Nothing has to be forever. We make choices all the time, and at the time we made the choice for 16 bits, it seemed the right balance between completeness and space usage. We still offer an 8-bit version, because even only going to 16 bits we caught some flak for it due to the space it uses. Perhaps someday the balance will tip in favor of a few more bits.
> * expand your characters and hope that nothing breaks too badly; or
Nah; no more breakage than between our 8 and 16-bit offerings.
> * expose the nightmare of surrogate pairs to programmers because you > lied when you said that a CHARACTER could actually hold a whole > character.
We try never to lie.
> Which did you go for?
Ah, yes, and "have you stopped beating your wife yet?"
Posix locales tend to be a pita and very limited, for there can only be one locale per application that has to be activated globally. However, that is not really what I was looking for.
Unicode, even at 16 bits, introduces composing characters, such as accents and other character marks, that make string comparison more complex than just comparing code points, because two code points may be equivalent to a precomposed one, or the order of composing codes may be irrelevant. String upcasing, downcasing and titlecasing differ, but would still be nicely matched to the corresponding lisp functions. String collation is implemented via a different algorithm that involves sorting tables.
Out of these and some other features, Allegro only seems to focus on the latest, and only marginally via two special purpose functions, for the implementation is really based on Posix locales and I presume the string comparison functions from the C library.
My question was more along the line of whether all that above could be painlessly integrated into the Common Lisp standard in a sensible way, but it seems no implementation has done that so far.
Again, as I said, it may be the case that there is little need for it, but it may also be the case that it is just a matter of some library or person taking the first step -- for the algorithms, AFAI can see, are not that complex.
On Jan 14, 12:26 am, Mark Wooding <m...@distorted.org.uk> wrote:
> du...@franz.com <du...@franz.com> wrote: > > At Franz we chose to follow the consensus of the programming community > > at large when we chose to go with Unicode characters (though not full- > > sized - we use 16 bit characters)
> As far as I can see, that leaves you with three choices:
> * abandon code-points outside the Base Multilingual Plane forever;
> * expand your characters and hope that nothing breaks too badly; or
If an implementation supports 16 bits characters it is quite likely that it will also work with more. According to Unicode any program is free to choose a subset of the character set that it supports, from 8 to 16 to 24 bits or whatever.
An implementation that restricts itself to 16 bits should, however, emit some kind of warning when it reads a surrogate pair, for that is not going to be properly handled by the implementation, but if you look at the character tables, that is very unlikely to happen -- unless you go for musical or mathematical scripts, or dead languages.
A more important point is in my opinion that there is no support for string normalizations and comparisons, and that this is not done transparently by the lisp. For the record, the _same_ character has more than one possible encoding. For instance, A with a circle above, can be U+212B, U+00C5 or U+0041+U+030A, and by Unicode standard the three are canonically equivalent -- string comparison should thus return T.
I do not know the level of support for Unicode that other languages provide. Java seems to have the IBM ICU library, which is also available in C/C++, but whether this is used and how often, escapes my knowledge.
On 2009-01-14, Juanjo <juanjose.garciarip...@googlemail.com> wrote:
> If an implementation supports 16 bits characters it is quite likely > that it will also work with more. According to Unicode any program is > free to choose a subset of the character set that it supports, from 8 > to 16 to 24 bits or whatever.
> An implementation that restricts itself to 16 bits should, however, > emit some kind of warning when it reads a surrogate pair, for that is > not going to be properly handled by the implementation, but if you > look at the character tables, that is very unlikely to happen -- > unless you go for musical or mathematical scripts, or dead languages.
While I'm not a fan of 16 bit characters, I don't agree with a need for warnings when reading surrogates.
An implementation that uses 16 bit characters must leave surrogates as-is when reading UTF-16 data from a file, while an implementation that uses >= 21 bit characters must assemble them into a code point.
In neither case is the Unicode handling "improper", so a warning should not be issued. What situation do you have in mind where a warning would be appropriate?
The erroneous situation is the use of actual Lisp characters corresponding to code points for surrogates in an implementation that has >= 21 bit characters. Clozure CL gets it right: CODE-CHAR returns NIL for surrogates. SBCL gets it wrong and returns #\UD800. (Allegro also gets it right: It returns a character, but since it has 16 bit characters, that is correct.)
Of course, 16 bit characters (i.e., use of UTF-16 in Lisp strings) is not an ideal compromise: It combines the disadvantages of UTF-8 with the disadvantages of UTF-32. Like UTF-8, it is a variable-length encoding and doesn't actually provide an equivalence between Unicode code points and Lisp characters. And like full 21 bit characters (i.e., use of UTF-32 in Lisp strings), it is not a space efficient representation.
And variable length encodings like UTF-8 and UTF-16 mean that string-related algorithms have to match and manipulate substrings, not individual characters. But due to combining characters, Unicode-aware applications can't implement string algorithms on a character-by-character basis anyway, so variable length encodings actually introduce no additional complications. Hence, for most programming languages, use of UTF-8 as an internal representation is ideal: It's space efficient and doesn't introduce much run-time overhead.
Unfortunately, Lisp specifies strings to be arrays of characters, and users can reasonably assume that arrays support O(1) access to arbitrary arrays elements, so UTF-8 is not an attractive internal representation in Lisp.
That leaves UTF-16 and UTF-32 to choose from. While UTF-16 has the flaws mentioned above, the more important consideration to me seems compatibility. The free Lisps chose UTF-32, the commercial Lisps chose UTF-16, so you get to pick which set of implementations you want to be compatible with.
The result is a nightmare for anyone writing Unicode-conforming code. I recently did some work to fix surrogate handling in cxml and related libraries. The resulting #+/-lisp-uses-surrogates conditionalization is not pretty. (Since XML is specified to support all of Unicode, and test suites check for that, my code wouldn't pass important test suites if surrogate handling was missing. So I can't just take the position that surrogates are unimportant in the real world and ignore them.)
On Jan 14, 11:24 am, David Lichteblau <usenet-2...@lichteblau.com> wrote:
> While I'm not a fan of 16 bit characters, I don't agree with a need for > warnings when reading surrogates.
> An implementation that uses 16 bit characters must leave surrogates > as-is when reading UTF-16 data from a file, while an implementation that > uses >= 21 bit characters must assemble them into a code point.
> In neither case is the Unicode handling "improper", so a warning should > not be issued. What situation do you have in mind where a warning would > be appropriate?
When that text is to be interpreted as a string. You just said that code-char returns nil for a code in the surrogate region, but what is an application to do when it finds a surrogate at a place in a string? The question is that Unicode explicitely permits your application or implementation to restrict itself to a subset of characters and thus treat surrogate pairs as ignorable -- note that I mean the pair, not just the surrogate codepoint. This would lead to a consistent behavior for an implementation that is not able to properly interpret characters outside the BMP and would thus return wrong properties for them.
In other words, you may restrict yourself to 16 bits without using UTF-16 encoding. Whether your implementation issues warnings or uses replacement characters for unsupported surrogate pairs, or simply ignores them, it is something that should be documented.
This is not the default behavior that I have chosen for ECL, and it may not be what _you_ or your library users need, but I understand that it could be a valid configuration option for people with certain memory requirements or which need wide character strings that are compatible with the requirements of its operating system.
On 2009-01-14, Juanjo <juanjose.garciarip...@googlemail.com> wrote:
> When that text is to be interpreted as a string. You just said that > code-char returns nil for a code in the surrogate region, but what is > an application to do when it finds a surrogate at a place in a string?
There would be no such character (that's why CODE-CHAR returns NIL), so this situation cannot occur.
[...]
> This is not the default behavior that I have chosen for ECL, and it > may not be what _you_ or your library users need, but I understand > that it could be a valid configuration option for people with certain > memory requirements or which need wide character strings that are > compatible with the requirements of its operating system.
Okay.
The trouble is that Lisp libraries need to accept strings created by other Lisp code in the same image. Any restriction on characters not imposed by the implementation are lingering problems that need to be dealt with by the callee.
I'm mostly interested in being able to re-use existing code without having to change it all over the place.
On Jan 14, 2:15 pm, David Lichteblau <usenet-2...@lichteblau.com> wrote:
> On 2009-01-14, Juanjo <juanjose.garciarip...@googlemail.com> wrote:
> > When that text is to be interpreted as a string. You just said that > > code-char returns nil for a code in the surrogate region, but what is > > an application to do when it finds a surrogate at a place in a string?
> There would be no such character (that's why CODE-CHAR returns NIL), so > this situation cannot occur.
I do not find this answer satisfactory: an implementation reads a sequence of code units (that is bare (unsigned-byte 16) words) in UTF-16 encoding. It stores internally strings using 16-bit words and it finds a surrogate pair in the input stream: it will be read and stored in a string. Now you say that the surrogate does not exist as character. How is the surrogate pair stored? Is it simply ignored? Does it store the high and low surrogate code points? Does it store a replacement character?
> I'm mostly interested in being able to re-use existing code without > having to change it all over the place.
My wish as well -- that is why this thread was started.
ECL uses internally UTF-32, so there are no surrogates to handle. Babel will become unnecessary at least in ECL, due to the support of external formats and (soon) stream on sequence types.
Now, as for CL-UNICODE, it currently includes no support for strings -- it only works at the character level, no notion of UTF-8, -16, etc, and it works only with characters and code-points assuming the implementation supports all of Unicode.
My idea, and what I have posted to the cl-unicode mailing list (no answer so far), would be that the algorithms for dealing with Unicode strings should be part of cl-unicode. This library should provide drop- in replacements for things like character and string comparisons, plus additional functions for normalization, handling locales, etc.
> On Jan 14, 2:15 pm, David Lichteblau <usenet-2...@lichteblau.com> > wrote:
> > On 2009-01-14, Juanjo <juanjose.garciarip...@googlemail.com> wrote:
> > > When that text is to be interpreted as a string. You just said that > > > code-char returns nil for a code in the surrogate region, but what is > > > an application to do when it finds a surrogate at a place in a string?
> > There would be no such character (that's why CODE-CHAR returns NIL), so > > this situation cannot occur.
> I do not find this answer satisfactory: an implementation reads a > sequence of code units (that is bare (unsigned-byte 16) words) in > UTF-16 encoding. It stores internally strings using 16-bit words and > it finds a surrogate pair in the input stream: it will be read and > stored in a string. Now you say that the surrogate does not exist as > character. How is the surrogate pair stored? Is it simply ignored? > Does it store the high and low surrogate code points? Does it store a > replacement character?
> > I'm mostly interested in being able to re-use existing code without > > having to change it all over the place.
> My wish as well -- that is why this thread was started.
> ECL uses internally UTF-32, so there are no surrogates to handle. > Babel will become unnecessary at least in ECL, due to the support of > external formats and (soon) stream on sequence types.
> Now, as for CL-UNICODE, it currently includes no support for strings > -- it only works at the character level, no notion of UTF-8, -16, etc, > and it works only with characters and code-points assuming the > implementation supports all of Unicode.
> My idea, and what I have posted to the cl-unicode mailing list (no > answer so far), would be that the algorithms for dealing with Unicode > strings should be part of cl-unicode. This library should provide drop- > in replacements for things like character and string comparisons, plus > additional functions for normalization, handling locales, etc.
* guideline for Lisp implementors * a list of the various problems supporting Unicode in CL * possible problems with the integration of Unicode in ANSI CL * reasoning behind design decisions
On 2009-01-14, Juanjo <juanjose.garciarip...@googlemail.com> wrote:
> I do not find this answer satisfactory: an implementation reads a > sequence of code units (that is bare (unsigned-byte 16) words) in > UTF-16 encoding. It stores internally strings using 16-bit words and > it finds a surrogate pair in the input stream: it will be read and > stored in a string. Now you say that the surrogate does not exist as > character.
No, I agree with you. If the implementation uses UTF-16 internally, surrogates are characters.
> ECL uses internally UTF-32, so there are no surrogates to handle. > Babel will become unnecessary at least in ECL, due to the support of > external formats and (soon) stream on sequence types.
That's great. (I expect portable code to use libraries like Babel for various use cases anyway, but native external format code is important to have.)
> My idea, and what I have posted to the cl-unicode mailing list (no > answer so far), would be that the algorithms for dealing with Unicode > strings should be part of cl-unicode. This library should provide drop- > in replacements for things like character and string comparisons, plus > additional functions for normalization, handling locales, etc.
> Posix locales tend to be a pita and very limited, for there can only > be one locale per application that has to be activated globally. > However, that is not really what I was looking for.
To be clear; we got the locale definitions from Posix (actually, from IBM), but we're not actually using Posix locales. We have our own locale object that is dynamically rebindable. Thus, we can have multiple locales within a single lisp application.
> I am slowly adding support for Unicode and external formats to ECL and > found several incompatibilities between the Unicode specification and > ANSI Common Lisp.
> Most of them are related to letter cases and the fact that this is > forced to be an invertible transformation in CL and it is not in > Unicode. The version of SBCL I have solves this by only using char- > upcase/downcase on characters where the transformation is one-to-one.
> The other place is string comparisons or "collation". Unicode demands > a normalization process before comparing, so that the placement of > different composing characters (accents, marks, etc), does not affect > comparison.
> Another place is newline endings, which in Unicode is not just CR or > LF, or a combination of both, while ANSI CL demands a single #\Newline > character.
> My question is what is the level of support that the different > implementations provide and whether you consider that the ANSI > specification has become obsolete in this respect and should be > probably ignored -- perhaps with a configuration flag that selects a > Unicode conformant behavior.
> Juanjo
btw, emacs lisp has the best unicode support out of the box of all programing lang i'm a expert of. (perl, python, php, java. (haven't used unicode with Mathematica much, but it's abilities for arbitrary chars and math symbols is built in before unicode became popular by some 10 or 20 years)
Here's a very ROUGH ordered list of langs in terms of quality of unicode support, from my experience:
emacs lisp, java, javascript, php, python, perl. (Common Lisp, Haskell, Scheme, would be after python or perl)
David Lichteblau <usenet-2...@lichteblau.com> writes: > If the implementation uses UTF-16 internally, surrogates are > characters.
But a surrogate isn't a character. It's half a character, an encoding artifact with no more right to be called a character than a byte of a multibyte UTF-8 encoding. This is what I meant when I called having CHARACTER store a UTF-16 encoding unit a lie. What's CHAR-UPCASE of a surrogate?
This is the mess that Java got itself into by adopting Unicode early and overspecifying its `char' type. (That C# followed Java in this regard despite having the benefit of hindsight is criminal. But I digress.)
On 2009-01-14, Mark Wooding <m...@distorted.org.uk> wrote:
> But a surrogate isn't a character. It's half a character, an encoding > artifact with no more right to be called a character than a byte of a > multibyte UTF-8 encoding. This is what I meant when I called having > CHARACTER store a UTF-16 encoding unit a lie. What's CHAR-UPCASE of a > surrogate?
Right, but CHAR-UPCASE isn't well defined even if the implementation has 21 bit characters and disallows characters corresponding to surrogates, because combining characters need to be taken into account, so Unicode-aware code needs to upcase or downcase strings, not characters.
Unfortunately, STRING-UPCASE and STRING-DOWNCASE are specified to work on a character-by-character basis, so they can't be used either. An implementation (or a portable library) could offer alternative functions STRING-UPCASE and STRING-DOWNCASE in a separate package which can return results of a different length.
UTF-16 might be silly anyway, but doesn't make things worse in this case.