* Sander Vesik | Wake up, smnell the coffee and learn about 'combiners'. And then *think* | just a little bit, including about thinks like collation, sort order and | similar.
Perhaps you are unaware of the character concept as used in Unicode? It would seem prudent at this time for you to return to the sources and obtain the information you lack. To wit, what you incompetently refer to as "combiners" are actually called "combining characters". I suspect you knew that, too, since nobody _else_ calls them "combiners". But it seems that you are fighting for your honor, now, not technical correctness, and I shall leave to you another pathetic attempt to feel good about yourself when you should acknowledge inferior knowledge and learn something.
Oh, by the way, Unicode has three levels. Study Unicode, and you will know that they mean and what they do. Hint: "variable-length character" is an incompetent restatement. A single _glyph_ may be made up of more than one _character_ and a given glyph may be specifed using more than one character. If you had known Unicode at all, you would know this.
/// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.
In comp.lang.scheme Erik Naggum <e...@naggum.net> wrote:
> * Sander Vesik > | Wake up, smnell the coffee and learn about 'combiners'. And then *think* > | just a little bit, including about thinks like collation, sort order and > | similar.
> Perhaps you are unaware of the character concept as used in Unicode? It > would seem prudent at this time for you to return to the sources and > obtain the information you lack. To wit, what you incompetently refer to > as "combiners" are actually called "combining characters". I suspect you > knew that, too, since nobody _else_ calls them "combiners". But it seems > that you are fighting for your honor, now, not technical correctness, and > I shall leave to you another pathetic attempt to feel good about yourself > when you should acknowledge inferior knowledge and learn something.
I don't subscribe to the concept of honour. I also couldn't care less what you think of me.
> Oh, by the way, Unicode has three levels. Study Unicode, and you will > know that they mean and what they do. Hint: "variable-length character" > is an incompetent restatement. A single _glyph_ may be made up of more > than one _character_ and a given glyph may be specifed using more than > one character. If you had known Unicode at all, you would know this.
It is pointless to think of glyph in any other way than characters - it should not make any difference whetever adiaresis is represented by one code point - the precombined one - or two. In fact, if there is a detctable difference from anything dealing with text strings the implementation is demonstratably broken.
* Sander Vesik | I also couldn't care less what you think of me.
You should realize that only people who care a lot, make this point.
| It is pointless to think of glyph in any other way than characters - it | should not make any difference whetever adiaresis is represented by one | code point - the precombined one - or two. In fact, if there is a | detctable difference from anything dealing with text strings the | implementation is demonstratably broken.
It took the character set community many years to figure out the crucial conceptual and then practical difference between the "characteristic glyph" of a character and the character itself, namly that a character may have more than one glyph, and a glyph may represent more than one character. If you work with characters as if they were glyphs, you _will_ lose, and you make just the kind of arguments that were made by people who did _not_ grasp this difference in the ISO committees back in 1992 and who directly or indirectly caused Unicode to win over the original ISO 10646 design. Unicode has many concessions to those who think character sets are also glyph sets, such as the presentation forms, but that only means that there are different times you would use different parts of the Unicode code space. Some people who try to use Unicode completely miss this point.
It also took some _companies_ a really long time to figure the difference between glyph sets and character sets. (E.g., Apple and Xerox, and, of course, Microsoft has yet to reinvent the distinction badly in the name of "innovation", so their ISO 8859-1-like joke violates important rules for character sets.) I see that you are still in the pre-enlightenment state of mind and have failed to grasp what Unicode does with its three levels. I cannot help you, since you appear to stop thinking in order to protect or defend yourself or whatever (it sure looks like som mideast "honor" codex to me), but if you just pick up the standard and read its excellent introductions or even Unicode: A Primer, by Tony Graham, you will understand a lot more. It does an excellent job of explaining the distinction between glyph and character. I think you need it much more than trying to defend yourself by insulting me with your ignorance.
Now, if you want to use or not use combining characters, you make an effort to convert your input to your preferred form before you start processing. This isolates the "problem" to a well-defined interface, and it is no longer a problem in properly designed systems. If you plan to compare a string with combining characters with one without them, you are already so confused that there is no point in trying to tell you how useless this is. This means that thinking in terms of "variable-length characters" is prima facie evidence of a serious lack of insight _and_ an attitude problem that something somebody else has done is wrong and that you know better than everybody else. Neither are problems with Unicode.
/// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.
So a secondary question; if one is designing a new Common Lisp or Scheme system, and one is not encumbered by any requirements about being consistent with existing code, existing operating systems, or existing communications protocols and interchange formats: that is, if one gets to design the world over again:
Should the Scheme/CL type "character" hold Unicode characters, or Unicode glyphs? (It seems clear to me that it should hold characters, but I might be thinking about it poorly.)
And, whichever answer, why is that the right answer?
> Should the Scheme/CL type "character" hold Unicode characters, or > Unicode glyphs? (It seems clear to me that it should hold characters, > but I might be thinking about it poorly.)
> And, whichever answer, why is that the right answer?
one could use "the cheap man's unicode" or utf-8. actually personally I don't care so much about unicode and have held it in the "possibly later" respect. for now it is not terribly important as I can just restrict myself to the lower 128 characters. in any case it sounds simpler to implement than the "codepage" system, so I will probably use it.
"ich bin einen Amerikaner, und ich tun nicht erweiterter Zeichen noetig" (don't mind bad grammar, as I don't really know german...).
* tb+use...@becket.net (Thomas Bushnell, BSG) | Should the Scheme/CL type "character" hold Unicode characters, or | Unicode glyphs? (It seems clear to me that it should hold characters, | but I might be thinking about it poorly.)
There are no Unicode glyphs. This properly refers to the equivalence of a sequence of characters starting with a base character and optinoally followed combining characters, and "precomposed" characters. This is the canonical-equivalence of character sequences. A processor of Unicode text is allowed to replace any character sequence with any of its canonically-equivalent character sequences. It is in this regard that an application may want to request a particular composite character either as one character or a character sequence, and may decide to examine each coded character element individually or as an interpreted character. These constitute three different levels of interpretation that it must be possible to specify. Since an application is explicitly permitted to choose any of the canonical-equivalent character sequences for a character, the only reasonable approach is to normalize characters into a known internal form.
There is one crucial restriction on the ability to use equivalent character sequences. ISO 10646 defines implementation levels 1, 2 and 3 that, respectively, prohibit all combining characters, allow most combining characters, and allow all combining characters. This is a very important part of the whole Unicode effort, but Unicode has elected to refer to ISO 10646 for this, instead of adopting it. From my personal communication with high-ranking officials in the Unicode consortium, this is a political decision, not a technical one, because it was feared that implementors that would be happy with trivial character-to-glyph--mapping software (such as a conflation of character and glyph concepts and fonts that support this conflation), especially in the Latin script cultures, would simply drop support for the more complex usage of the Latin script and would fail to implement e.g., Greek properly. Far from being an enabling technology, it was feared that implementing the full set of equivalences would be omitted and thus not enable the international support that was so sought after. ISO 10646, on the other hand, has realized that implementors will need time to get all this right, and may choose to defer implementation of Unicode entirely if they are not able to do it stepwise. ISO 10646 Level 1 is intended to be workable for a large number of uses, while Level 3 is felt not to have an advantage qua requirement until languages that require far more than composition and decomposition to be fully supported. I concur strongly with this.
The character-to-glyph mapping is fraught with problems. One possible way to do this is actually to use the large private use areas to build glyphs and then internally use only non-combining characters. The level of dynamism in the character coding and character-to-glyph mapping here is so much difficult to get right that the canonical-equivalent sequences of characters (which is a fairly simple table-lookup process) pales in comparison. That is, _if_ you allow combining characters, actually being able to display them and reason about them (such as computing widths or dealing with character properties of the implicit base character or converting their case) is far more difficult than decomposing and composing characters.
As for the scary effect of "variable length" -- if you do not like it, canonicalize the input stream. This really is an isolatable non-problem.
/// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.
* Thomas Bushnell, BSG | So a secondary question; if one is designing a new Common Lisp or Scheme | system, and one is not encumbered by any requirements about being | consistent with existing code, existing operating systems, or existing | communications protocols and interchange formats: that is, if one gets to | design the world over again:
If we could design the world over again, the _first_ ting I would want to do is making "capital letter" a combining modifier instead of doubling the size of the code space required to handle it. Not only would this be such a strong signal to people not to use case-sensitive identifiers in programming languages, we would have a far better time as programmers. E.g., considering the enormous amount of information Braille can squeeze into only 6 bits, with codes for many common words and codes to switch to and from digits and to capital letters, the limitations of their code space has effectively been very beneficial.
/// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.
> If we could design the world over again, the _first_ ting I would > want to do is making "capital letter" a combining modifier instead > of doubling the size of the code space required to handle it. Not > only would this be such a strong signal to people not to use > case-sensitive identifiers in programming languages, we would have > a far better time as programmers.
Could you elaborate on that a bit? I'm interested because it appears that you're position is that case-sensitivity in identifiers is a Bad Thing for programming languages.
A general principle of mine is that if things are distinguishable, they should not be collapsed but the distinction should be preserved whenever possible. Treating different characters as the same character, or treating different character sequences as equivalent, should be postponed as long as possible in order to preserve information.
Are you suggesting that this principle is inappropriate to apply to the character sequences that compose identifiers in source code? That would mean that "ABLE" is the same identifier as "able". I must admit that when I first found out that current lisps have case-insensitive symbol names, I thought it reminiscent of BASIC -- kind of a throwback to a time when memory was much more at a premium. (I know that Lisp predates BASIC. I'm talking about my reaction.) I'd be happy to hear a good case for case-insensitive identifiers.
> ... > > If we could design the world over again, the _first_ ting I would > > want to do is making "capital letter" a combining modifier instead > > of doubling the size of the code space required to handle it. Not > > only would this be such a strong signal to people not to use > > case-sensitive identifiers in programming languages, we would have > > a far better time as programmers.
> Could you elaborate on that a bit? I'm interested because it appears > that you're position is that case-sensitivity in identifiers is a Bad > Thing for programming languages.
> A general principle of mine is that if things are distinguishable, > they should not be collapsed but the distinction should be preserved > whenever possible. Treating different characters as the same > character, or treating different character sequences as equivalent, > should be postponed as long as possible in order to preserve > information.
Psychology experiments have empirically shown that memory is auditory. That is, when you misremember words, you misremember them by soundalike, not by lookalike. There is also ample linguistic evidence that the core of human language is an auditory phenomenon. When languages vary, they first change in their spoken form and then later writing catches up, not much vice versa. Since the spoken form has no notation for case differentiation, the pretty obvious conclusion is that conceptual information is not best carried in case. People don't remember whether they saw a word written in uppercase or lowercase, they just remember the word. It is very rare and quite awkward for someone to say "Use Capitalized-Foo" or "Use All-Uppercase-FOO" to someone out loud in areas other than computer science where people have worked themselves into corners by being pedantic on a "general principle" as in your previous paragraph rather than observing well-researched truths about how people really think.
Some of us believe that a proper harmonization/synchronization with the way peoples' brains work is more important than catering to a theoretical model that some people think would be a nice way for people to think.
I personally have made it a design goal in languages that I've worked on to think hard about making even programming languages gracefully pronounceable so that people can talk about programs aloud to each other over dinner, etc. Modern Lisp has mostly moved away from obscure little names like "rplacd" and such (a small number being retained mostly for history). For new concepts, make names like MOST-POSITIVE-FIXNUM not MAXINT.
Even in cased languages, mostly people don't use case to distinguish, they just use it for controlling the look of code. It's not uncommon for people to have some things named Foo and others named BAR, but it's rarer for things to be both named foo and Foo in a context where simple namespacing can't tell the difference. So often again you don't hear people saying the case out loud because it can be determined from other factors. At that point, you might as well let people write stuff in whatever case they want, for ease of input, and just let code pretty-printers adjust the case to a pretty look if it's really needed.
IMO, no ordinary code should ever be case-sensitive and it's a darned shame that XML is uses case-sensitive identifiers. I think it does mainly so it can service languages that have made a bad design decision ... so it's a dependent bad decision, not an independent one.
> Are you suggesting that this principle is inappropriate to apply to > the character sequences that compose identifiers in source code? That > would mean that "ABLE" is the same identifier as "able".
Yes.
> I must admit > that when I first found out that current lisps have case-insensitive > symbol names, I thought it reminiscent of BASIC -- kind of a throwback > to a time when memory was much more at a premium. (I know that Lisp > predates BASIC. I'm talking about my reaction.) I'd be happy to hear > a good case for case-insensitive identifiers.
Cased names are often a substitute in infix languages for having given up hyphen in a way that got messy. You can't call a variable MOST-POSITIVE-FIXNUM in most languages, because it thinks you mean MOST - POSITIVE - FIXNUM, a subtraction. Dylan requires you to put spaces around minus so it can have both minus and subtraction. Doing MostPositiveFixnum is not very natural and also forces case to be used in a way that supports separation, taking away the ability to use case for what it was intended for: supporting the underlying language. So if I have a word like eBusiness in "English" and I want to compose it into a function, do I make it be MakeeBusinessName or MakeEbusinessName or .... personally, I prefer make-eBusiness-name.
It might even be better to use _'s, but it's a shifted character on most keyboards, and people with weak fingers hate shifting that often, so hyphens tend to be preferred. make_eBusiness_name might otherwise be better, and would save confusion with minus sign.
[CL uses uppercase as the canonical case for the case-normalized name, and that's controversial with some people, but some of us like it. In any case, it's orthogonal to this other question about case translation.]
In any case, my real point is not to say there's a 100% clear answer here, but merely to motivate that the choice of case-translation is not archaic but definitely has support from people who think themselves to be living in the present.
* Ed L Cashin <ecas...@uga.edu> | Could you elaborate on that a bit? I'm interested because it appears | that you're position is that case-sensitivity in identifiers is a Bad | Thing for programming languages.
I consider it a bad thing to believe that A is a different character from a just because it has a certain "presentation property". I mean, we do not distinguish characters based on font or face, underlining or color, and most people realize that these are incidental properties. However, capitalness of a letter is just as incidental: The fact that a letter is capitalized depending on such randomness as the position of the word in the sentence is a very strong indicator that "However" and "however" are not different words, which is effectively what case-sensitive people think they are. I tried to publish text without this incidental property for a while, but it seemed to tick people off even more than calling an idiot an idiot.
| A general principle of mine is that if things are distinguishable, they | should not be collapsed but the distinction should be preserved whenever | possible. Treating different characters as the same character, or | treating different character sequences as equivalent, should be postponed | as long as possible in order to preserve information.
If you use colors to distinguish keywords from identifiers in our editor, can you use a keyword with a different color as an identifier?
| Are you suggesting that this principle is inappropriate to apply to the | character sequences that compose identifiers in source code? That would | mean that "ABLE" is the same identifier as "able".
| I must admit that when I first found out that current lisps have | case-insensitive symbol names, I thought it reminiscent of BASIC -- kind | of a throwback to a time when memory was much more at a premium.
But this is not the case. The symbol names are case-sensitive, but the Common Lisp reader maps all unescaped characters to uppercase by default. You can change this. Symbols are in this fashion just like normal words in your natural language.
| (I know that Lisp predates BASIC. I'm talking about my reaction.) I'd | be happy to hear a good case for case-insensitive identifiers.
I think case sensitivity is an abuse of an incidental property. Thus, I want to hear a good case for case-sensitive identifers. Older languages did not have this property, but after Unix (which has a case-insensitive tty mode!), the norm became to distinguish case, largely because there were no other namespace functionality in early C. Unix also chose to use lower-case commands whereas Multics had always supported case-folding. I believe the reason that the Unix people wanted to distinguish case was that it would require an extra instruction and a lookup table that would waste a precious 128 bytes of memory in the kernel, while we currently waste an enormous amount of memory to keep case-folding tables several times over. In my view, case-sensitive identifiers has become the norm in a community that has failed to think about proper solutions to their problems, but rather choose to solve only the immediate problem, much like C strongly encourages irrelevant micro-optimization. So instead of being nice to the user, they were nice to the programmer, who did not have to case-fold the incomding identifiers. I consider moving this burdon onto the user to be quite user-inimical and actually quite foreign to people who do not know the character coding standards. I mean, do we have case-sensitive trademarks, even though we traditionally capitalize proper names? Are Oracle and ORACLE different companies any more than ORACLE in red boldface 14 point Times Roman is a different company than ORACLE in blue italic 12 point Helvetica?
There has definitely been "paradigm shift" in computer people's view on case, but not in non-computer people. Internet protocols like SMTP use case-insensitive commands. The DNS is case-insensitive. SGML is case-insensitive and so is HTML. Because of the huge problems we face with case-folding Unicode (which must be done with a table of some kind), some people have figured that we should _not_ do case-folding. That is the wrong solution to the problem. The right solution to the problem is to get rid of case as a character property.
Now, assume that we no longer have different character codes for lower- case and upper-case letters. Would there be any difference in how we look at text on computer screens, in print, etc? No, of course not. Therefore, people would still be able to distinguish identifiers visually based on case if they want to -- just like the Common Lisp reader allows you to write |car| to refer to the symbol named "car", and |CAR| to refer to the symbol named "CAR", and just like Unix can deal with upper- and lower-case letters even when iuclc and olcuc is in effect with the xcase option by backslashing the real uppercase characters in your input. (In Common Lisp, you would backslash a lower-case character in the default reader mode, and the printer will escape those characters that should not be case-folded.) However, being able to do something and actually doing it are two very different things. E.g., on TOPS-20, you could use lower-case letters in filenames if you really wanted to, by prefixing them with ^V. Very few people bothered to do this because typing it in was a hassle. I do not propose any change to how we input upper and lower case, but with the anal-retentive approach to saving bits, which has even gone so far as to write FooBarZot instead of foo-bar-zot, the probablity that they C freaks would have chosen case-sensitivity would be remarkably lower -- if we could go back and design the world over...
/// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.
Centuries ago, Nostradamus foresaw when Kent M Pitman <pit...@world.std.com> would write:
> Psychology experiments have empirically shown that memory is > auditory. That is, when you misremember words, you misremember them > by soundalike, not by lookalike. There is also ample linguistic > evidence that the core of human language is an auditory phenomenon. > When languages vary, they first change in their spoken form and then > later writing catches up, not much vice versa.
I agree in part.
The "western" languages certainly are representative of that; our languages are largely a way of taking what we say and putting it on paper. (Computers being an insignificant "blip" thus far in the history of it :-).)
My understanding of the Asian languages is that they are often _not_ such a representation; what is written is _not_ an account what is spoken. Writing is, there, representative of a separate language. In more clearly "pictographic" languages, there may _not_ be an auditory form except as constructed afterwards.
That caveat being given, words don't usually sound different when they have different casing and aren't usually recognized as being different.
"That" is not a different word from "that." -- (reverse (concatenate 'string "ac.notelrac.teneerf@" "454aa")) http://www.ntlug.org/~cbbrowne/linux.html "Of _course_ it's the murder weapon. Who would frame someone with a fake?"
> A general principle of mine is that if things are distinguishable, > they should not be collapsed but the distinction should be preserved > whenever possible. Treating different characters as the same > character, or treating different character sequences as equivalent, > should be postponed as long as possible in order to preserve > information.
This is your opinion, and many people agree with you, but many do not, as well. This is a very controversial subject. And it's not just in comp.lang.lisp that you'll find this same controversy; at about the same time as our last discussion here there was a similar one raging on comp.arch. The difference was that here the case-insensitive style being advocated was (of course) the case-folding style that the Common Lisp reader standardizes, and in comp.arch the predominant case-insensitive style being argued was the "case-preserving" style, which is the kind of recognition style that both Mac and Windows filesystems support (i.e. first reference gets internalized as originally specified, but subsequent references are matched against the filename without regard to case). This case-preserving insensitive style was being pitted against the Unix case-sensitive style. Of course, neither side changed the other's mind.
Arguing case-sensitivity is very similar to arguing endianness; there are good arguments for both big-endian and little-endian, and neither side is fully right or fully wrong, though a decision must usually be made, because it is generally hard to mix the two together in the same machine.
> Are you suggesting that this principle is inappropriate to apply to > the character sequences that compose identifiers in source code? That > would mean that "ABLE" is the same identifier as "able". I must admit > that when I first found out that current lisps have case-insensitive > symbol names, I thought it reminiscent of BASIC -- kind of a throwback > to a time when memory was much more at a premium. (I know that Lisp > predates BASIC. I'm talking about my reaction.) I'd be happy to hear > a good case for case-insensitive identifiers.
First, I'll note (as others have) that Common Lisp does have case-sensitive identifiers, and always has. It is the reader that is specified to fold to uppercase by default. And even the standard CL reader is highly configurable, to allow cases to be specified by readtable options.
Second, the choice of case-sensitivity or not is not bounded by time. Going back to the endianness question, some engineers 10 years ago said "the little-endian side has lost". However, I suspect that if you count all of the little-endian machines in existence today, you find it hard to justify that claim. In fact, even many computers which are generally considered to be big-endian are now architected to allow for either endianness.
Finally, I personally believe in choice. Our own product has always allowed one to choose whether to decide on the Common Lisp specified case-insensitive reader, or whether to configure the reader to be case-sensitive by default. Our customer base has always taken advantage of that choice, with anywhere from approximately 20% to 35% choosing the case-sensitive mode, and the majority choosing the Common Lisp (case-insensitive, folding to uppercase) mode. And of course, this does not account for people who use lisps of both modes for different purposes. Nowadays, there is a slight increase in case-sensitive mode for the purpose of interfacing relatively directly with some currently popular case-sensitive languages. The point, though, is that we have always provided a choice, and always intend to provide a choice.
In fact, Kent Pitman recently sent us a proposal for unifying the two major case-modes that Allegro CL provides, in such a way that the two can exist in the same lisp simultaneously. We have an rfe (request for enhancement document) which starts with his proposal as a basis. I would love to see us succeed in making this or any similar unification, and I was excited to see Kent's proposal when he sent it to us.
It's all about choice. Calling the case-insensitive choice a "throwback" is the same as calling it invalid (or no longer valid). And based on my own experience here and in comp.arch, that is simply incorrect. People still choose both styles, and probably always will.
-- Duane Rettig Franz Inc. http://www.franz.com/ (www) 1995 University Ave Suite 275 Berkeley, CA 94704 Phone: (510) 548-3600; FAX: (510) 548-8253 du...@Franz.COM (internet)
Erik Naggum <e...@naggum.net> writes: > * Ed L Cashin <ecas...@uga.edu> > | Could you elaborate on that a bit? I'm interested because it appears > | that you're position is that case-sensitivity in identifiers is a Bad > | Thing for programming languages.
> I consider it a bad thing to believe that A is a different character from > a just because it has a certain "presentation property". I mean, we do > not distinguish characters based on font or face, underlining or color, > and most people realize that these are incidental properties. However, > capitalness of a letter is just as incidental: The fact that a letter is > capitalized depending on such randomness as the position of the word in > the sentence is a very strong indicator that "However" and "however" are > not different words, which is effectively what case-sensitive people > think they are.
This is not strictly true in all (natural) languages.
Example 1: German: - no 1-1 correspondence between upper-case and lower-case (there is one letter that only exists in the lower-case set) - some words change class, meaning, and pronunciation when going from one case to the other (example: Weg vs. weg) - case is used (or at least has been -- until it became non-pc in some circles) to put semantic fine points into print (e.g., capitalization of the second person in letters for politeness)
Example 2: Japanese - there is no distinction between upper-case and lower-case at all - HOWEVER: there are still two distinct sets of the phonetic characters called "hiragana" and "katakana". Either one could spell the entire language, but usage of the two sets again depends on things like origin of the word in question, emphasis, style, etc. One could think of katakana as the upper-case version of hiragana. Usage is often analogous, for example one would sometimes find hiragana words spelled in katakana for EMPHASIS. - Written Japanese also uses kanji (Chinese characters), all of which could be spelled either in hiragana or katakana. Unfortunately, the mapping between kanji and hiragana is many-to-many, which shows that the "is the same word" relationship is not an equivalence relation because it is not transitive: "hashi" (chopsticks) and "hashi" (bridge) are spelled exactly the same in hiragana (but are pronounced slightly differently), but the kanji for the respective words are not the same. OTOH, "kyou" and "konnichi" are clearly not the same words when spelled phonetically, but both correspond to the same kanji combination. There are literally thousands of examples for this in Japanese (which does not make it particularly easy to learn :-).
Example 3: English - Speaking of "him" and speaking of "Him" are clearly semantically very different.
Example 4: Mathematics (well, this one is not "natural", after all...) - In the "language of mathematics" we frequently make semantic distinctions between typographically different versions of the "same" character.
Anyway, all I wanted to say was that the distinction between different versions of a character set are not completely incidental in many (most?) natural languages. I do not want to use this as as argument for or against case-sensitive identifiers in programming languages, since I do not think that programming languages should in any form or manner be modelled after natural ones. (However, I must admit that I personally prefer being able to use mixed case when programming.)
* Matthias Blume <matth...@shimizu-blume.com> | This is not strictly true in all (natural) languages.
All of these arguments indicate that using the capital letter for the sentence-initial word is a very bad design choice for a written language; it violates that strong sense of difference that those who want it to exist focus so strongly on. However, I would argue that the sheer acceptability of destroying the importance of the capital letter in the sentence-intiial word cannot be ignored. When I tried to _preserve_ the case of the word despite its position in the sentence, this was regarded as Very Wrong by a bunch of hostile lunatics. This indicated to me that case is _primarily_ incidental, since the intrinsic role can at any time be overridden by the incidental role -- specifically, you have no idea whatsoever what the capitalization of the sentence-initial word would be if it were moved, yet this causes absolutely no problem for anyone.
| Anyway, all I wanted to say was that the distinction between different | versions of a character set are not completely incidental in many (most?) | natural languages.
In real life, nothing is ever completely anything. People use and abuse case "because it's there". This would not change if capital letters were coded with a "flag" that communicated capitalness. On the contrary, if we had such a flag, the natural development is to have _two_ flags: One for the incidental capital and one for the intrinsic capital. In either case, the display and the coding properties of a character should be separated. You provided an excellent example of this with hiragana and katakana.
| I do not want to use this as as argument for or against case-sensitive | identifiers in programming languages, since I do not think that | programming languages should in any form or manner be modelled after | natural ones.
That is not the argument. Please try to understand this. The point is that I have taken the liberty to design the world over again, backing up to _before_ computer geeks coded their character sets, and making a crucial change to the coding of upper-case vs lower-case characters. The names "upper-case" and "lower-case" refer to typographic characteristics, not meaning. Meaning may be coded separately from typography, just as we do in almost every other case,
| (However, I must admit that I personally prefer being able to use mixed | case when programming.)
If it had been most costly for you to achieve this, in terms of "knowing" that you would waste additional space to encode capital letters, would you still have done preferred it? I believe, from the reactions to the extended experiment with not randmoly upcasing the sentence-initial word, that people would be inclined to accept a coding overhead for that role, as well as for proper nouns, but randmonly and liberally sprinkling such overhead throughout identifiers in order to achieve an unnatural visual effect only because it could be done, would most likely not happen. As Common Lisp uses the hyphen to separate words, which would have no higher overhead than embedded capital letters, other languages would have far less inclination to make this horrible mistake, and would therefore not _require_ case-sensitivity.
Whether the programmers would prefer a case-folding or a case-preserving case-insensitivty is an open question, but at least designing languages and coding conventions to use case would not likely happen if case was regarded as just as incidental as color or typeface.
/// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.
* Matthias Blume <matth...@shimizu-blume.com> | Oh, how I'd *love* to live in a world where Erik Naggum is God... :-)
Yeah, me too. Then I could force you to pay attention to the premises that start a discussion instead of completely ignoring the context. Please see <3225942059872...@naggum.net>, and pay particular attention to what Thomas Bushnell wrote.
Sheesh, some people.
/// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.
Erik Naggum <e...@naggum.net> writes: > * Matthias Blume <matth...@shimizu-blume.com> > | Oh, how I'd *love* to live in a world where Erik Naggum is God... :-)
> Yeah, me too.
I was under the impression that you thought you already did. :-)
> Then I could force you to pay attention to the premises > that start a discussion instead of completely ignoring the context. > Please see <3225942059872...@naggum.net>, and pay particular attention to > what Thomas Bushnell wrote.
To be frank, I do not care *one bit* about what this discussion was originally about. I was merely commenting on your claim about capitalization being "incidental". The debate of whether or not case-sensitive identifiers in programming languages are Good or Evil, or which character set design use up more bits than others, etc., bore me.
Matthias Blume <matth...@shimizu-blume.com> writes: > Erik Naggum <e...@naggum.net> writes:
> > * Matthias Blume <matth...@shimizu-blume.com> > > | Oh, how I'd *love* to live in a world where Erik Naggum is God... :-)
> > Yeah, me too.
> I was under the impression that you thought you already did. :-)
> > Then I could force you to pay attention to the premises > > that start a discussion instead of completely ignoring the context. > > Please see <3225942059872...@naggum.net>, and pay particular attention to > > what Thomas Bushnell wrote.
> To be frank, I do not care *one bit* about what this discussion was > originally about. I was merely commenting on your claim about > capitalization being "incidental". The debate of whether or not > case-sensitive identifiers in programming languages are Good or Evil, > or which character set design use up more bits than others, etc., bore > me.
Capitalization _is_ incidental. It is ceremonially marked in written text, but my impression based on a basic knowledge of linguistics and a casual outside view of German [I don't purport to speak the langauge] is that German people may claim that "weg" and "Weg" are different words, but the capitalization is not pronounced audibly, so there is generally enough contextual information to disambiguate in speech. Certainly this is the case for English situations like "God loves you." and "The god loves you." These are different words, God. One is a proper name and one isn't. But if it were miscapitalized "god loves you" or "The God loves you". It is possible for there to be ambiguity in spite of this in some cases, but it's also possible to have ambiguity in the case of correct case, too. Human language is not precise. But normally where a confusion is common, some audible notation arises to disambiguate. And, incidentally, the audible notation is [to my knowledge] never the addition of the word "uppercase" or "lowercase" because that just isn't the issue in play. It's usually the addition of a guide word, a case marking, a determiner, etc.
> [ ... ] outside view of German [I don't purport to speak the > langauge] is that German people may claim that "weg" and "Weg" are > different words, but the capitalization is not pronounced audibly,
The two words are pronounced very differently.
> so there is generally enough contextual information to disambiguate in > speech.
Ok, so everything that can be inferred from context is "incidental" then? Most spelling mistakes can be inferred from context, so should we make programming languages tolerate them? (It has been tried, as you know.)
Anyway, this whole debate is supremely silly, IMHO. Fortunately neither you nor Erik get to dictate the rules, at least not for those languages that I speak or program in...
Erik Naggum <e...@naggum.net> writes: > Yeah, me too. Then I could force you to pay attention to the premises > that start a discussion instead of completely ignoring the context. > Please see <3225942059872...@naggum.net>, and pay particular attention to > what Thomas Bushnell wrote.
So, getting back to my original question about charset implementations in Lisp/Scheme (though actually Smalltalk or any such dynamically-typed language will have the same questions and probably the same kinds of solutions), I've done some more study and thinking, so let me try again. My previous question was a tad innocent, it appears, because I was unaware of the great changes that have taken place in Unicode since the last time I read through it and grokked the whole thing (which was back at version 1.2 or something).
I haven't fully internalized the terminology yet, though I'm trying. So please bear with any minor terminological gaffes (and correct them, too).
The GNU/Linux world is rapidly converging on using UTF-8 to hold 31-bit Unicode values. Part of the reason it does this is so that existing byte streams of Latin-1 characters can (pretty much) be used without modification, and it allows "soft conversion" of existing code, which is quite easy and thus helps everybody switch.
But I'm thinking about a "design the world over again" kind of strategy. Now Erik is certainly right that capitalization *should* be a combining character kind of thing. So let me stipulate that I want to take Unicode as-is; I get to design *my computer system*, subject to the a priori constraint that Unicode has done a *lot* of work, so I will accept slight deficiencies if they help Unicode work right on the system. So I'll take the existing Unicode encodings, even if they don't do capitals just like we'd want.
But I don't get to redesign existing communications protocols and such; however, that's an externalization issue, and for internal use on the system, such protocols don't matter. Similar comments apply for existing filesystems formats, file conventions, and the like.
Now, I *could* just use UTF-8 internally, but that seems rather foolish. I think it's obvious that characters should be "immediately" represented in pointer values in the way that fixnums are.
Now the Universal Character Set is officially 31 bits, but only 16 bits are in use now, and it is expected that at most 21 bits will be used. So that means it's pretty easy to make sure the whole space of UCS values fits in an immediate representation. That's fine for working with actively used data.
However, strings that are going to be kept around a long time should, it seems to me, be stored more compactly. Essentially all strings will be in the Basic Multilingual Plane, so they can fit in 16 bits. That means there would be two underlying string datatypes. I don't think this is a serious problem. Is it worth having a third (for 8-bit characters) so that Latin-1 files don't have to be inflated by a factor of two? It seems to me that this would be important too. Basically then we would have strings which are UCS-4, UCS-2 and Latin-1 restricted (internally, not visibly to users).
So even if strings are "compressed" this way, they are not UTF-8. That's Right Out. They are just direct UCS values. Procedures like string-set! therefore might have to inflate (and thus copy) the entire string if a value outside the range is stored. But that's ok with me; I don't think it's a serious lose.
So is this sane?
Ok, then the second question is about combining characters. Level 1 support is really not appropriate here. It would be nice to support Level 3. But perhaps Level 2 with Hangul Jamo characters [are those required for Level 2?] would be good enough.
It seems to me that it's most appropriate to use Normalization Form D. Or is that crazy? It has the advantage of holding all the Level 3 values in a consistent way. (Since precombined characters do not exist for all possibilities, Normalization Form C results in some characters precombined and some not, right?)
And finally, should the Lisp/Scheme "character" data type refer to a single UCS code point, or should it refer to a base character together with all the combining characters that are attached to it?
Matthias Blume <matth...@shimizu-blume.com> writes: > Kent M Pitman <pit...@world.std.com> writes:
> > [ ... ] outside view of German [I don't purport to speak the > > langauge] is that German people may claim that "weg" and "Weg" are > > different words, but the capitalization is not pronounced audibly,
> The two words are pronounced very differently.
> > so there is generally enough contextual information to disambiguate in > > speech.
> Ok, so everything that can be inferred from context is "incidental" > then? Most spelling mistakes can be inferred from context, so should > we make programming languages tolerate them? (It has been tried, as > you know.)
Please read Aristotle on Virtue Ethics. The mean between unreasonable extremes is not something with a fixed answer. The fact that its precise point in design space is not uniquely determined does not mean it should not be something people strive for. If anyone seriously wants to defend spelling errors as a good design theory, we could have a discussion about it. Otherwise, it's a pointless red herring. I do, however, contend a theory behind the point of view CL has, and was merely describing that point of view.
> Anyway, this whole debate is supremely silly, IMHO. Fortunately > neither you nor Erik get to dictate the rules, at least not for those > languages that I speak or program in...
We aren't dictating rules, and I personally don't really appreciate this attempt to recast my defense of an arbitrary but reasonable design choice into some sort of attempt at an ignorant attempt to control the world.
All we have done is to try to explain the present state of affairs based on an attempt for harmony with something people do with a great deal of statistical regularity. Probably there is no deed that everyone does with any predictability other than, as they say, death and taxes, but it seems inappropriate to base design on the idea that this implies no other large scale regularities worth checking into...
> Please read Aristotle on Virtue Ethics. The mean between unreasonable > extremes is not something with a fixed answer.
It can also only be determined by the man with a particular virtue known as "practical wisdom", as well. And, with practical wisdom, comes all the virtues, not just one or two. Which means that only the person with true virtue is even able to tell what the Right Thing to do is.
Aristotle's talk of a "mean" is a metaphor, of course. It's some kind of balance, some kind of "just enough" notion.
Some medievals liked to poo poo this by taking it overliterally, with a rather snide attack. Thomas Aquinas, however, liked the "mean" theory, and here's how he treats of the snide attackers (from the "Quastio disputata de virtutibus in communi", Article 13, Objection 7 and the response):
Whether virtue lies in a mean. It seems not....Boethius in "On arithmetic" speaks of a threefold mean, the arithmetical, as 6 between 4 and 8 which is an equal distance from both, and the geometrical, as 6 between 9 and 4, which is proportionally the same distance from both, and the harmonic or musical mean, as 3 between 6 and 2 because there is the same proportion of one extreme to the other, namely, 3 (which is the different between 6 and 9) to 1 which is the difference between 2 and 3. But none of these means is found in virtue, since the mean of virtue does not relate equally to extremes, nor in a quantitative way nor according to some proportion of the extremes and differences. Therefore, virtue does not lie in the mean.
[replies Thomas]: It should be said that the means spoken of by Boethius lie in things and thus are not relevant to the mean of virtue which is determined by reason. Justice seems to be an exception since it involves both a mean in things and another according to reason: The arithmetical mean is relevant to exchange and the geometrical to distribution, as is clear from [Aristotle's Nicomachean] Ethics [book] 5.
Anyway, I'd recommend the Nicomachean Ethics of Aristotle to anyone interested in thinking. You'll find it aggravating; he's quite unmodern and actually quite bogus in a lot of ways, but he is truly important and it will change a great deal about how you think, if you take it seriously.
Erik Naggum <e...@naggum.net> wrote in message <news:3226054464281011@naggum.net>... > ... but at least designing languages > and coding conventions to use case would not likely happen if case was > regarded as just as incidental as color or typeface.
OTOH, if terminals had gotten color and typefaces earlier, maybe programming languages would have evolved to use them. Maybe give each namespace its own color, so you would specify the value of a name by putting it in blue, the function by using red, keywords in italics, macros in green. The mind boggles at the possibilities. In fact, if you want to boggle your mind, see
> We aren't dictating rules, and I personally don't really appreciate this > attempt to recast my defense of an arbitrary but reasonable design choice > into some sort of attempt at an ignorant attempt to control the world.
Sorry, I was unreasonably hash on you, Kent.
> All we have done is to try to explain the present state of affairs based > on an attempt for harmony with something people do with a great deal of > statistical regularity.
As I have tried to point out, this sort of regularity isn't actually quite as regular as some try to make it. The Japanese language is a great example (although there the distiction is not called "uppercase vs. lowercase").
By the way, here is an example in a case-sensitive natural language where the distinction between uppercase and lowercase gets *pronounced*: "mit" vs. "MIT" in German. The first means "with" and is pronounced like "mitt", the second is the Massachussetts Institute of Technology and is pronounced like speakers of English would pronounce it: em-ay-tee. I think that there are enough examples of this around so that making a distinction between uppercase and lowercase is warranted in the natural language case. Again, I do not think that this needs to be in any way correlated with the PL case.
> Capitalization _is_ incidental. It is ceremonially marked in written > text, but my impression based on a basic knowledge of linguistics and > a casual outside view of German [I don't purport to speak the > langauge] is that German people may claim that "weg" and "Weg" are > different words, but the capitalization is not pronounced audibly, so > there is generally enough contextual information to disambiguate in > speech.
Well, in fact 'Weg' and 'weg' *are* pronounced differently, one with a long 'e' and the other with a short one - that is because they are different words. Should you incidentally start a sentence with 'weg', thus writing it with capital 'W' it would still be pronounced like 'weg'. This might be difficult to understand, but that is how natural languages are, I guess.
Andreas -- Wherever I lay my .emacs, there´s my $HOME.