This came up in qilang. I thought the fastest way to get some light on this would be to throw it up for general discussion here in Lisp. Is there a weakness in CL here?
Note the initial part of the post quoted is talking about Javascript/ Python.
QUOTE "\u661F" (星) is a string.
The same type as "\u65E5\u751F" (日生).
'\u661F' is the same as "\u661F" except for ' vs " quoting.
Javascript and python haven't quite escaped the conceptual trap of characters, since "\u661f" is a string of length 1, and "\u6535\u751f" is a string of length 2, but by having their 'characters' of the same class as their strings the problem is reduced.
You can think of them as having strings composed of substrings, with a fixed composition mechanism.
Ideally the composition mechanism would be variable, but even this compromise is a big step up from characters vs. strings in the CL sense.
Any CL system dealing with characters as unicode code-points will break when exposed to anything requiring combining character sequences -- e.g., Tibetan, Thai, lots of Indian scripts, etc.
With javascript, python, etc, in most cases you should at least be able to pass through sequences of characters where the programmer anticipated a single character without problems.
Even doing case conversion in German breaks in CL, since string- upcase can't map "ss" to ß, since it operates on a character by character basis. UNQUOTE
Mark Tarver <dr.mtar...@ukonline.co.uk> writes: > Even doing case conversion in German breaks in CL, since string- > upcase can't map "ss" to ß, since it operates on a character by > character basis.
I guess you mean map ß to SS? I don't know if that's really a big problem, since the opposite conversion (SS to ss or ß) is so hard (impossible to implement without a full knowledge of the ss vs ß grammar rules).... -- (espen)
Espen Vestre <es...@vestre.net> writes: >> Even doing case conversion in German breaks in CL, since string- >> upcase can't map "ss" to ß, since it operates on a character by >> character basis.
> I guess you mean map ß to SS? I don't know if that's really a big > problem, since the opposite conversion (SS to ss or ß) is so hard > (impossible to implement without a full knowledge of the ss vs ß grammar > rules)....
According to my unicode chart popup, ß (U+00DF) upcases to ẞ(U+1E9E).
Mark Tarver <dr.mtar...@ukonline.co.uk> writes: > This came up in qilang. I thought the fastest way to get some light > on this would be to throw it up for general discussion here in Lisp. > Is there a weakness in CL here?
I'm not convinced that characters are a bad thing. Unicode code-points, perhaps (well perhaps not a bad thing, but a low level technicality), but I don't a problem in distinguishing characters from strings.
In anycase, if you prefer substrings, nothing prevents you to use SUBSEQ instead of CHAR or AREF.
> Any CL system dealing with characters as unicode code-points will > break when exposed to anything requiring combining character > sequences > -- e.g., Tibetan, Thai, lots of Indian scripts, etc.
Indeed. IMO, correct treatment of unicode in CL requires that CL characters be defined as combining character sequences (probably normalized, like MacOSX likes to do it). Notice that the CL standard explicitely says that (not (eq "é" "é")) is possible, there's a reason, and now we know it: unicode. It also means that CL characters, when supporting unicode, must be more complex a structure than mere fixnums. Deal with it!
> With javascript, python, etc, in most cases you should at least be > able to pass through sequences of characters where the programmer > anticipated a single character without problems.
> Even doing case conversion in German breaks in CL, since string-upcase > can't map "ss" to ß, since it operates on a character by character > basis.
STRING-UPCASE wouldn't have a problem with "ss" -> "ß", but NSTRING-UPCASE would.
However, STRING-UPCASE is not defined to perform a localized text upcasing, but in terms of CHAR-UPCASE, which is defined in terms of LOWER-CASE-P and UPPER-CASE-P (more precisely both CHAR-UPCASE and LOWER-CASE-P are defined in terms of the same glossary entry).
In any case, I repeat, the standard CL functions are not defined in terms of localized text.
> You can think of them as having strings composed of substrings, with > a > fixed composition mechanism.
I think that this kind of thing is basically a disaster for a language which wants to have a coherent type system. You really, I think, have two options:
strings are a special magic type orthogonal to any other type (in particular they are not arrays or sequences);
strings are some kind of sequence/array type.
For a language which is not entirely about string bashing the latter option is the obvious one. Then you have to ask the question: what type are strings sequencs of? The answer can't be "strings" because then the type system falls to bits in some horrible way: you need another type (or the option of another type: strings could be sequences whose elements are (strings OR <something>).) That other type, in CL, is characters.
The other option, having strings *not* be sequences but some special magic type is possible, and I suspect it's basically what Perl does, for instance (in so far as Perl has a coherent type system at all). I would not want CL to do this.
I don't know how this maps on to unicode: I do know that unicode has lots of cases where complicated things happen, and I also know that I don't understand it. Unfortunately the only person I knew who I was sure really *did* understand it is dead, so we can't get his opinion. I don't think that the sharp-s / ss thing in German has much to do with this, because there are lots of complicated rules around that (which I can no longer remember). It may be that assuming things like string-upcase / char-upcase &c are simple is just a huge mistake.
Joost Kremers <joostkrem...@yahoo.com> writes: > but currently, german spelling rules state that ß upcases to SS; officially, > there is no uppercase ß in german.
Hmm, ok. I thought I had seen ß in upcased words but I haven't been exposed to that much German. There's a Wikipedia article:
On Fri, 18 Mar 2011 09:42:18 +0100, Espen Vestre wrote: > Mark Tarver <dr.mtar...@ukonline.co.uk> writes:
>> Even doing case conversion in German breaks in CL, since string- upcase >> can't map "ss" to ß, since it operates on a character by character >> basis.
> I guess you mean map ß to SS? I don't know if that's really a big > problem, since the opposite conversion (SS to ss or ß) is so hard > (impossible to implement without a full knowledge of the ss vs ß grammar > rules)....
I agree. I don't think it is reasonable to expect the standard library (eg of CL) to deal with the idiosyncratic rules of natural languages, especially considering that there are many languages and their rules sometime change (German changed recently).
CL implementations can surely _represent_ text, especially with Unicode, and the library functions can handle simple operations for languages that follow the patterns of English. I guess it is up to libraries to handle more complex stuff.
> This came up in qilang. I thought the fastest way to get some light > on this would be to throw it up for general discussion here in Lisp. > Is there a weakness in CL here? ... > Opinions?
General consensus is that utf-32's only advantage is that it is slightly easier to decode than other variants. It is not useful for string indexing, merging, substitution, etc. Unless you only care about ASCII.
Paul Rubin wrote: > Joost Kremers <joostkrem...@yahoo.com> writes: >> but currently, german spelling rules state that ß upcases to SS; officially, >> there is no uppercase ß in german.
> Hmm, ok. I thought I had seen ß in upcased words
yes, it does appear sometimes, it's just not officially sanctioned by the Rat für Deutsche Rechtschreibung (Council for German Orthography). (though the german wikipedia article on ß <http://de.wikipedia.org/wiki/%C3%9F> suggests this may change in the future.)
-- Joost Kremers joostkrem...@yahoo.com Selbst in die Unterwelt dringt durch Spalten Licht EN:SiS(9)
> On 03/18/2011 04:19 AM, Mark Tarver wrote: >> This came up in qilang. I thought the fastest way to get some light >> on this would be to throw it up for general discussion here in Lisp. >> Is there a weakness in CL here? > ... >> Opinions?
I think the thing that distinguishes string datatypes from vectors of numbers is the explicit expectation that strings be human readable. This extends to the expectation that the language can manipulate strings in a "natural" way (e.g. things like word splitting, capitalization, and text rendering).
As Tim Bray notes, much code simply needs a fast solution that simply treats strings as an opaque bag-o-bits to be copied and handed to other code for rendering. So the correct solution isn't always the best one.
January was an interesting month for string discussion on the Boost devel list. Here are the main threads I remember. While there are the usual flames and such, there are some good posts in here.
One last thing I wanted to add: as specified, unicode (especially utf8) is somewhat like a Huffman code optimized for ASCII. Unicode needs a universal external format; but a given text rarely contains a random mix of characters. Thus I suspect that many programs would benefit from using a locale-specific internal format.
I think that's all I'll say unless people have specific questions.
Ar an naoú lá déag de mí Márta, scríobh D. Herring:
> One last thing I wanted to add: as specified, unicode (especially utf8) is > somewhat like a Huffman code optimized for ASCII. Unicode needs a universal > external format; but a given text rarely contains a random mix of > characters. Thus I suspect that many programs would benefit from using a > locale-specific internal format.
We’re stuck with UTF-8. The closest thing to a locale-specific internal format would be something like windows-1251 or Shift-JIS, and using them leads to data loss because not everyone using Windows-1251 will stick to that subset of Cyrillic plus ASCII, and not everyone using Shift-JIS will stick to ASCII and the relevant Kanji and kana.
> I think that's all I'll say unless people have specific questions. > > - Daniel
-- “Apart from the nine-banded armadillo, man is the only natural host of Mycobacterium leprae, although it can be grown in the footpads of mice.” -- Kumar & Clark, Clinical Medicine, summarising improbable leprosy research
> Ar an naoú lá déag de mí Márta, scríobh D. Herring:
> > One last thing I wanted to add: as specified, unicode (especially utf8) is > > somewhat like a Huffman code optimized for ASCII. Unicode needs a universal > > external format; but a given text rarely contains a random mix of > > characters. Thus I suspect that many programs would benefit from using a > > locale-specific internal format.
> We’re stuck with UTF-8. The closest thing to a locale-specific internal > format would be something like windows-1251 or Shift-JIS, and using them > leads to data loss because not everyone using Windows-1251 will stick to > that subset of Cyrillic plus ASCII, and not everyone using Shift-JIS will > stick to ASCII and the relevant Kanji and kana.
No need to use windows-1251 or anything along those lines. One can map arbitrarily to an array of 8-bit elements, or 16-bit for a language that needs a lot of characters. You get the performance benefit of fixed-length characters if you're doing a lot of work that involves finding a character in a specific location, and then you convert back to UTF-8 when you save to a file, transmit over the network, or whatever you're going to do. No data loss at all.
On Mar 18, 5:19 am, Mark Tarver <dr.mtar...@ukonline.co.uk> wrote:
> Any CL system dealing with characters as unicode code-points will > break when exposed to anything requiring combining character > sequences > -- e.g., Tibetan, Thai, lots of Indian scripts, etc.
> With javascript, python, etc, in most cases you should at least be > able to pass through sequences of characters where the programmer > anticipated a single character without problems.
> Even doing case conversion in German breaks in CL, since string- > upcase > can't map "ss" to ß, since it operates on a character by character > basis.
Only if you use some function that specifically operates on a character per character basis. I think nstring-upcase or char-upcase would fall into this category.
But for others, such as string-upcase, it is just a matter of having a locale sensitive operation that will correctly replace your ss with ß.
No "breaking", it is just a bug introduced by the developer.
Anticomuna <ts.concei...@uol.com.br> writes: > But for others, such as string-upcase, it is just a matter of having a > locale sensitive operation that will correctly replace your ss with ß.
Yes, something that is not string-upcase, since string-upcase is specified to work character by character.
>> Thus I suspect that many programs would benefit from using a >> locale-specific internal format.
> We’re stuck with UTF-8.
I'm very confused by this. Do you intend "internal format" to mean "the format of strings in memory" or "a private file format for an application", or something else? Do there exist CL implementations whish use UTF-8 for strings?
On Mar 19, 12:35 pm, "Pascal J. Bourguignon" <p...@informatimago.com> wrote:
> Anticomuna <ts.concei...@uol.com.br> writes: > > But for others, such as string-upcase, it is just a matter of having a > > locale sensitive operation that will correctly replace your ss with .
> Yes, something that is not string-upcase, since string-upcase is > specified to work character by character.
Well, CL was created in a time where internationalization wasn't much of a concern and that's why it just ignores it. It is possible for an implementor to change the string-upcase function to handle locales appropriately.
The nstring-upcase wouldn't allow it because it just alters an existing string. Char-upcase wouldn't allow it because it returns just one character at a time.
But my point was that what the OP said has nothing to do with Unicode. All is needed is a proper library.
Ar an naoú lá déag de mí Márta, scríobh Tim Bradshaw:
> On 2011-03-19 14:37:20 +0000, Aidan Kehoe said: > > > Ar an naoú lá déag de mí Márta, scríobh D. Herring: > > >> One last thing I wanted to add: as specified, unicode (especially > >> utf8) is somewhat like a Huffman code optimized for ASCII. Unicode > >> needs a universal external format; but a given text rarely contains a > >> random mix of characters. Thus I suspect that many programs would > >> benefit from using a locale-specific internal format. Thus I suspect > >> that many programs would benefit from using a locale-specific > >> internal format. > > > We’re stuck with UTF-8. > > I'm very confused by this. Do you intend "internal format" to mean "the > format of strings in memory" or "a private file format for an application", > or something else?
I didn’t write “internal format”, but I understood it as *mostly* in the first sense.
I mean, we *could* develop Unicode transformation formats that are optimised for particular scripts, and the PRC’s GB18030 is such a transformation format, in a sense. But no-one’s going to, there’s not sufficient benefit from a locale-specific internal format that supports all of Unicode.
> Do there exist CL implementations whish use UTF-8 for strings?
I’ve restored some context you snipped above. I don’t know if any CL implementations use UTF-8 for strings, but some certainly use Unicode, which is where D. Herring is coming from there.
-- “Apart from the nine-banded armadillo, man is the only natural host of Mycobacterium leprae, although it can be grown in the footpads of mice.” -- Kumar & Clark, Clinical Medicine, summarising improbable leprosy research
> I didn’t write “internal format”, but I understood it as *mostly* in the > first sense.
I apologise, I think I've misparsed the messages.
What I read was D. Herring sauing "... many programs could benefit from using a locale-specific internal format." and you replying "We’re stuck with UTF-8." to which I assumed was added an implicit "... as an internal format", obviously incorrectly.
Obviously many (perhaps all) CLs use Unicode now (and presumably all of those support UTF-8 as an external format), but I can't imagine how they would use UTF-8 for strings.
Anticomuna <ts.concei...@uol.com.br> writes: > On Mar 19, 12:35 pm, "Pascal J. Bourguignon" <p...@informatimago.com> > wrote: >> Anticomuna <ts.concei...@uol.com.br> writes: >> > But for others, such as string-upcase, it is just a matter of having a >> > locale sensitive operation that will correctly replace your ss with .
>> Yes, something that is not string-upcase, since string-upcase is >> specified to work character by character.
> Well, CL was created in a time where internationalization wasn't much > of a concern and that's why it just ignores it. It is possible for an > implementor to change the string-upcase function to handle locales > appropriately.
Ok, in the case of string-upcase, since it accepts keyword arguments, it could be extended with a :localized-for parameter.
> The nstring-upcase wouldn't allow it because it just alters an > existing string. Char-upcase wouldn't allow it because it returns just > one character at a time.
This is not the reason. The reason why it's possible is because string-upcase is defined to take keyword arguments. Therefore it's possible for the implemention to define new keyword arguments, and for the program to call it with :allow-other-keys t :localised-for :tibetan arguments.
> But my point was that what the OP said has nothing to do with Unicode. > All is needed is a proper library.
Agreed, there's no need to change CL:STRING-UPCASE when we can have a LOCALIZATION:TEXT-UPCASE function at the same time.