Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Wide character implementation

53 views
Skip to first unread message

Thomas Bushnell, BSG

unread,
Mar 19, 2002, 12:08:15 AM3/19/02
to

If one uses tagged pointers, then its easy to implement fixnums as
ASCII characters efficiently.

But suppose one wants to have the character datatype be 32-bit Unicode
characters? Or worse yet, 35-bit Unicode characters?

At the same time, most characters in the system will of course not be
wide. What are the sane implementation strategies for this?

Frode Vatvedt Fjeld

unread,
Mar 19, 2002, 4:08:59 AM3/19/02
to
tb+u...@becket.net (Thomas Bushnell, BSG) writes:

> If one uses tagged pointers, then its easy to implement fixnums as
> ASCII characters efficiently.

Hm.. perhaps you mean it's easy to implement characters as immediate
values?

> But suppose one wants to have the character datatype be 32-bit
> Unicode characters? Or worse yet, 35-bit Unicode characters?
>
> At the same time, most characters in the system will of course not
> be wide. What are the sane implementation strategies for this?

I suppose to assign "most characters in the system" to a sub-type of
the wide characters, and implement that sub-type as immediates.

--
Frode Vatvedt Fjeld

Pierpaolo BERNARDI

unread,
Mar 19, 2002, 5:22:05 AM3/19/02
to

"Thomas Bushnell, BSG" <tb+u...@becket.net> ha scritto nel messaggio
news:87wuw92...@becket.becket.net...

>
> If one uses tagged pointers, then its easy to implement fixnums as
> ASCII characters efficiently.
>
> But suppose one wants to have the character datatype be 32-bit Unicode
> characters? Or worse yet, 35-bit Unicode characters?

21 bits are enough for Unicode.

P.


Erik Naggum

unread,
Mar 19, 2002, 5:53:48 AM3/19/02
to
* Thomas Bushnell, BSG

| If one uses tagged pointers, then its easy to implement fixnums as
| ASCII characters efficiently.

Huh? No sense this makes.

| But suppose one wants to have the character datatype be 32-bit Unicode
| characters? Or worse yet, 35-bit Unicode characters?

Unicode is a 31-bit character set. The base multilingual plane is 16
bits wide, and then there are the possibility of 20 bits encoded in two
16-bit values with values from 0 to 1023, effectively (+ (expt 2 20) (-
(expt 2 16) 1024 1024)) => 1112064 possible codes in this coding scheme,
but one does not have to understand the lo- and hi-word codes that make
up the 20-bit character space. In effect, you need 16 bits. Therefore,
you could represent characters with the following bit pattern, with b for
bits and c for code. Fonts are a mistake, so is removed.

000000ccccccccccccccccccccc00110

This is useful when the fixnum type tag is either 000 for even fixnums
and 100 for odd fixnums, effectively 00 for fixnums. This makes
char-code and code-char a single shift operation. Of course, char-bits
and char-font are not supported in this scheme, but if you _really_ have
to, the upper 4 bits may be used for char-bits.

| At the same time, most characters in the system will of course not be
| wide. What are the sane implementation strategies for this?

I would (again) recommend actually reading the specification. The
character type can handle everything, but base-char could handle the
8-bit things that reasonable people use. The normal string type has
character elements while base-string has base-char elements. It would
seem fairly reasonable to implement a *read-default-string-type* that
would take string or base-string as value if you choose to implement both
string types.

///
--
In a fight against something, the fight has value, victory has none.
In a fight for something, the fight is a loss, victory merely relief.

Janis Dzerins

unread,
Mar 19, 2002, 6:31:52 AM3/19/02
to
"Pierpaolo BERNARDI" <pierpaolo...@hotmail.com> writes:

What "Unicode"?

--
Janis Dzerins

Eat shit -- billions of flies can't be wrong.

Pierpaolo BERNARDI

unread,
Mar 19, 2002, 9:51:38 AM3/19/02
to

"Janis Dzerins" <jo...@latnet.lv> ha scritto nel messaggio
news:87d6y0z...@asaka.latnet.lv...

> "Pierpaolo BERNARDI" <pierpaolo...@hotmail.com> writes:
>
> > "Thomas Bushnell, BSG" <tb+u...@becket.net> ha scritto nel messaggio
> > news:87wuw92...@becket.becket.net...
> > >
> > > If one uses tagged pointers, then its easy to implement fixnums as
> > > ASCII characters efficiently.
> > >
> > > But suppose one wants to have the character datatype be 32-bit Unicode
> > > characters? Or worse yet, 35-bit Unicode characters?
> >
> > 21 bits are enough for Unicode.
>
> What "Unicode"?

The character encoding standard defined by the Unicode Consortium Inc.,
Are there other Unicodes?

P.

Sander Vesik

unread,
Mar 19, 2002, 11:22:30 AM3/19/02
to
In comp.lang.scheme Thomas Bushnell, BSG <tb+u...@becket.net> wrote:
>
> If one uses tagged pointers, then its easy to implement fixnums as
> ASCII characters efficiently.
>
> But suppose one wants to have the character datatype be 32-bit Unicode
> characters? Or worse yet, 35-bit Unicode characters?

They use either UTF8 or UTF16 - you cannot rely on whetvere size
you pick to be suitably long forever, unicode is sort of inherently
variable-length (characters even have too possible representations
in many cases, &auml; and similar 8-)

>
> At the same time, most characters in the system will of course not be
> wide. What are the sane implementation strategies for this?
>

Implement them as variable-length strings using say UTF-8. Also, saying that
most characters will not be wide may well be a wrong assumptin 8-)

--
Sander

+++ Out of cheese error +++

Sander Vesik

unread,
Mar 19, 2002, 11:27:04 AM3/19/02
to
In comp.lang.scheme Erik Naggum <er...@naggum.net> wrote:
> * Thomas Bushnell, BSG
> | If one uses tagged pointers, then its easy to implement fixnums as
> | ASCII characters efficiently.
>
> Huh? No sense this makes.
>
> | But suppose one wants to have the character datatype be 32-bit Unicode
> | characters? Or worse yet, 35-bit Unicode characters?
>
> Unicode is a 31-bit character set. The base multilingual plane is 16
> bits wide, and then there are the possibility of 20 bits encoded in two
> 16-bit values with values from 0 to 1023, effectively (+ (expt 2 20) (-
> (expt 2 16) 1024 1024)) => 1112064 possible codes in this coding scheme,
> but one does not have to understand the lo- and hi-word codes that make
> up the 20-bit character space. In effect, you need 16 bits. Therefore,
> you could represent characters with the following bit pattern, with b for
> bits and c for code. Fonts are a mistake, so is removed.
>
> 000000ccccccccccccccccccccc00110

I don't think this is true any more as of unicode 3.1 afaik, 16 bits is
no longer enough.

[snip - this doesn't sound like scheme]

Ben Goetter

unread,
Mar 19, 2002, 11:46:41 AM3/19/02
to
Quoth Pierpaolo BERNARDI:

> "Thomas Bushnell, BSG" <tb+u...@becket.net> ha scritto
> > But suppose one wants to have the character datatype be 32-bit Unicode
> > characters? Or worse yet, 35-bit Unicode characters?
>
> 21 bits are enough for Unicode.

And ISO 10646, per working group resolution.

http://std.dkuug.dk/JTC1/SC2/WG2/docs/n2175.htm
http://std.dkuug.dk/JTC1/SC2/WG2/docs/n2225.doc

lin8080

unread,
Mar 19, 2002, 1:45:18 PM3/19/02
to
Janis Dzerins schrieb:

> "Pierpaolo BERNARDI" <pierpaolo...@hotmail.com> writes:

> > "Thomas Bushnell, BSG" <tb+u...@becket.net> ha scritto nel messaggio
> > news:87wuw92...@becket.becket.net...

> > 21 bits are enough for Unicode.
>
> What "Unicode"?

Try:

http://www.linuxdoc.org/HOWTO/Unicode-HOWTO.html

http://www.cl.cam.ac.uk/~mgk25/unicode.html


stefan

Thomas Bushnell, BSG

unread,
Mar 19, 2002, 5:33:34 PM3/19/02
to
"Pierpaolo BERNARDI" <pierpaolo...@hotmail.com> writes:

> 21 bits are enough for Unicode.

Um, Unicode version 3.1.1 has the following as the largest character:

E007F;CANCEL TAG;Cf;0;BN;;;;;N;;;;;

Now the Unicode space isn't sparse, but I don't think compressing the
space is the most efficient strategy.

Erik Naggum

unread,
Mar 19, 2002, 6:15:12 PM3/19/02
to
* Janis Dzerins <jo...@latnet.lv>
| What "Unicode"?

unicode.org

Erik Naggum

unread,
Mar 19, 2002, 6:18:22 PM3/19/02
to
* Sander Vesik <san...@haldjas.folklore.ee>

| I don't think this is true any more as of unicode 3.1 afaik, 16 bits is
| no longer enough.

Please pay attention and actually make an effort to read what you respond
to, will you? You should also be able to count the number of c bits and
arrive at a number greater than 16 if you do no get lost on the way.

Sheesh, some people.

Erik Naggum

unread,
Mar 19, 2002, 6:22:39 PM3/19/02
to
* Sander Vesik <san...@haldjas.folklore.ee>

| They use either UTF8 or UTF16 - you cannot rely on whetvere size
| you pick to be suitably long forever, unicode is sort of inherently
| variable-length (characters even have too possible representations
| in many cases, &auml; and similar 8-)

Variable-length characters? What the hell are you talking about? UTF-8
is a variable-length _encoding_ of characters that most certainly are
intended to require a fixed number of bits. That is, unless you think
the digit 3 take up only 6 bits while the letter A takes up 7 bits and
the symbol ą takes up 8. Then you have variable-length characters. Few
people consider this a meaningful way of talking about variable length.

| Implement them as variable-length strings using say UTF-8. Also, saying
| that most characters will not be wide may well be a wrong assumptin 8-)

Real programming languages work with real character objects, not just
UTF-8-encoded strings in memory.

Acquire clue, _then_ post, OK?

Tim Moore

unread,
Mar 19, 2002, 6:32:19 PM3/19/02
to
On 19 Mar 2002 14:33:34 -0800, Thomas Bushnell, BSG <tb+u...@becket.net>
wrote:

Um, what's your point? E007f fits in 20 bits. If you're thinking
that's all that's needed, there are private use areas (E000..F8FF,
F0000..FFFFD, and 100000..10FFFD) that need to be encoded too. So 21
bits looks right.

Tim

Thomas Bushnell, BSG

unread,
Mar 19, 2002, 6:46:51 PM3/19/02
to
tmo...@sea-tmoore-l.dotcast.com (Tim Moore) writes:

> Um, what's your point? E007f fits in 20 bits. If you're thinking
> that's all that's needed, there are private use areas (E000..F8FF,
> F0000..FFFFD, and 100000..10FFFD) that need to be encoded too. So 21
> bits looks right.

Oh what an embarrassing brain fart, yes that's quite right. I don't
know what I was counting, but my head was clearly on backwards.

David Rush

unread,
Mar 20, 2002, 3:42:52 AM3/20/02
to
Erik Naggum <er...@naggum.net> writes:
> * Sander Vesik <san...@haldjas.folklore.ee>
> | They use either UTF8 or UTF16 - you cannot rely on whetvere size
> | you pick to be suitably long forever, unicode is sort of inherently
> | variable-length (characters even have too possible representations
> | in many cases, &auml; and similar 8-)
>
> Variable-length characters? What the hell are you talking about? UTF-8
> is a variable-length _encoding_ of characters that most certainly are
> intended to require a fixed number of bits. That is, unless you think
> the digit 3 take up only 6 bits while the letter A takes up 7 bits and
> the symbol ą takes up 8. Then you have variable-length characters. Few
> people consider this a meaningful way of talking about variable length.

Erik, this is beneath you. Surely you know that Octet != Character.

> Acquire clue, _then_ post, OK?

In context, rather pathetic, this seems...

david rush
--
The important thing is victory, not persistence.
-- the Silicon Valley Tarot

Pekka P. Pirinen

unread,
Mar 20, 2002, 11:20:00 AM3/20/02
to
[comp.lang.lisp only]

Erik Naggum <er...@naggum.net> writes:
> * Thomas Bushnell, BSG


> | At the same time, most characters in the system will of course not be
> | wide. What are the sane implementation strategies for this?
>

> [...] The normal string type has character elements while


> base-string has base-char elements. It would seem fairly
> reasonable to implement a *read-default-string-type* that would
> take string or base-string as value if you choose to implement
> both string types.

Yes, that's basically it.

In actual fact, Liquid and Lispworks have
*DEFAULT-CHARACTER-ELEMENT-TYPE* for various functions taking an
:ELEMENT-TYPE argument, and other similar needs. See
<http://www.xanalys.com/software_tools/reference/lwl42/LWRM-U/html/lwref-u-198.htm#pgfId-1008739>.
Although the doc doesn't say it (there's a lot of unpublished doc on
fat characters), LW:*DEFAULT-CHARACTER-ELEMENT-TYPE* also controls
what kind of strings the reader constructs from the "" syntax.
However, if characters of larger types are seen by the string reader,
a string that can hold these characters is constructed without
complaint.

(This also avoid any confusion from STRING being a supertype of
BASE-STRING.)

Note that it is the programmer's responsibility to choose and declare
suitable character and string types, if they want to write a program
that works efficiently with both BASE-CHAR and larger character sets.
The implementation cannot possibly know enough to make the right
choices. It can only offer a selection of types and interfaces to
control the types for each language feature.
--
Pekka P. Pirinen, Global Graphics Software Limited
In cyberspace, everybody can hear you scream. - Gary Lewandowski

Ray Dillinger

unread,
Mar 20, 2002, 5:29:16 PM3/20/02
to

I'd have a fixed-width internal representation -- probably 32 bits
although that's overkilling it by about a byte and a half, probably
identical to some mapping of the unicode character set -- and then
use i\o functions that were character-set aware and could translate
to and from various character sets and representations.

I wouldn't want to muck about internally with a format that had
characters of various different widths: too much pain to implement,
too many chances to introduce bugs, not enough space savings.
Besides, when people read whole files as strings, do you really
want to run through the whole string counting multi-byte characters
and single-byte characters to find the value of an expression like

(string-ref FOO charcount) ;; lookups in a 32 million character string!

where charcount is large? I don't. Constant width means O(1) lookup
time.

If space is limited, or if you're doing very serious performance
tuning, You might want to have two separate constant-width internal
character representations, one for short characters (ascii or 16bit)
and one for long (full unicode). But if so, you're going to have to
take it into account the extra space that will be used by the
additional executable code in your character and string comparisons
and manipulation functions, and deal with the increased complexity
there. That would introduce some mild insanity and chances for a few
bugs, but imo it's not as bad as variable-width characters.

What is sane, however, depends deeply on what environment you expect
to be in. You have to ask yourself whether the scheme you're writing
will be used with data in multiple character sets.

For example, will users want to read strings in ebcdic and write
them in unicode? How about the multiple incompatible versions of
ebcdic? Do you have to support them, or can we let them die now?
Will your implementation want to read and produce both UTF-8 and
UTF-16 output? Will you have to handle miscellaneous ISO character
sets that have different characters mapped to the same character
codes above 127? Or obsolete ascii where the character code we
use as backslash used to mean 1/8? How about five-bit Baudot
coding? :-)

Get character i/o functions that do translation, and then the
lookups and references and compares and everything just work for
free with simple code, and all you have to do to support a new
character set is to provide a new mapping that the i/o functions
can use.

Andy Heninger

unread,
Mar 21, 2002, 1:53:06 AM3/21/02
to
"Ray Dillinger" <be...@sonic.net> wrote

> Get character i/o functions that do translation, and then the
> lookups and references and compares and everything just work for
> free with simple code, and all you have to do to support a new
> character set is to provide a new mapping that the i/o functions
> can use.

If you want to provide full up international support, the code for string
manipulatioin becomes anything but simple, no matter what your string
representation. Think string compares that respect the cultural conventions
of different countries and languages (collation), for example. And if
you're thinking Unicode, this is the direction you're headed.

See IBM's open source Unicode library for a good example of what's
involved -
http://oss.software.ibm.com/icu

-- Andy Heninger
heni...@us.ibm.com

Erik Naggum

unread,
Mar 21, 2002, 5:14:25 AM3/21/02
to
* Pekka P. Pirinen

| Note that it is the programmer's responsibility to choose and declare
| suitable character and string types, if they want to write a program
| that works efficiently with both BASE-CHAR and larger character sets.

If they want that, they should always use the types string and character.
Only if the programmer knows that he creates base-string and with with
base-char objects, should he so declare them. Since string is carefully
worded to be a collection of types, an implementation that declares
strings exlusively will work for all subtypes of string.

Erik Naggum

unread,
Mar 21, 2002, 5:15:47 AM3/21/02
to
* David Rush <ku...@bellsouth.net>

| Erik, this is beneath you. Surely you know that Octet != Character.

If you think this is about octets, you are retarded and proud of it.

| > Acquire clue, _then_ post, OK?
|
| In context, rather pathetic, this seems...

Learn of what you speak, _then_ become a snotty asshole, OK?

Ray Dillinger

unread,
Mar 21, 2002, 11:21:57 AM3/21/02
to
Andy Heninger wrote:
>
> "Ray Dillinger" <be...@sonic.net> wrote

>
> If you want to provide full up international support, the code for string
> manipulatioin becomes anything but simple, no matter what your string
> representation. Think string compares that respect the cultural conventions
> of different countries and languages (collation), for example. And if
> you're thinking Unicode, this is the direction you're headed.

I dunno. As implementor I want to make it *possible* to
implement all the complications. I want to take the major
barriers out of the way and deal with encodings intelligently.
I'm willing to leave presentation and non-default collation
to the authors of language packages. Let someone who knows
and cares implement that as a library; I want to provide the
foundation stones so that she can, and provide default
semantics on anonymous characters (which, to me, includes
anything outside of the latin, european, extended latin,
and math planes) that are logical, consistent, and overridable.

Should the REPL rearrange itself to go top-char-to-bottom,
right-column-to-left, with prompts appearing at the top,
if someone has named their variables and defined their
symbols with kanji characters instead of latin? It's an
interesting thought. Should program code go in boustophedron
(alternating left-to-right in rows from top down) if someone
has named stuff using heiroglyphics? Um, maybe.... But is
the scheme system really where that kind of support is
needed, or would it just confuse people? And what's the
indentation convention for boustophedron?

Maybe that last byte-and-a-half should be used for left-right
and up-down and spacing properties and the scheme system itself
ought to do all that stuff. But it's not so important I'm
going to implement it before, say, read-write invariance on
procedure objects.

Bear

Duane Rettig

unread,
Mar 21, 2002, 1:00:01 PM3/21/02
to
"Andy Heninger" <an...@jtcsv.com> writes:

> "Ray Dillinger" <be...@sonic.net> wrote
> > Get character i/o functions that do translation, and then the
> > lookups and references and compares and everything just work for
> > free with simple code, and all you have to do to support a new
> > character set is to provide a new mapping that the i/o functions
> > can use.

Even before our current verion of Allegro CL (6.1), we were
supporting external-formats to exactly that extent, and it has
been extendible (for the most part). See

http://www.franz.com/support/documentation/6.0/doc/iacl.htm#locales-1

> If you want to provide full up international support, the code for string
> manipulatioin becomes anything but simple, no matter what your string
> representation. Think string compares that respect the cultural conventions
> of different countries and languages (collation), for example. And if
> you're thinking Unicode, this is the direction you're headed.
>
> See IBM's open source Unicode library for a good example of what's
> involved -
> http://oss.software.ibm.com/icu

We incorporate a large amount of IBM's work (and other work, as well)
in our current localization support. See

http://www.franz.com/support/documentation/6.1/doc/iacl.htm#localization-1

Note that we have chosen not to support LC_CTYPE and LC_MESSAGES at this time.
Also, LC_COLLATE is not supported for 6.1, but Unicode Collation Element
Tables (UCETs) will be supported for 6.2.

--
Duane Rettig Franz Inc. http://www.franz.com/ (www)
1995 University Ave Suite 275 Berkeley, CA 94704
Phone: (510) 548-3600; FAX: (510) 548-8253 du...@Franz.COM (internet)

Sander Vesik

unread,
Mar 22, 2002, 4:13:12 PM3/22/02
to
In comp.lang.scheme Erik Naggum <er...@naggum.net> wrote:
> * Sander Vesik <san...@haldjas.folklore.ee>
> | They use either UTF8 or UTF16 - you cannot rely on whetvere size
> | you pick to be suitably long forever, unicode is sort of inherently
> | variable-length (characters even have too possible representations
> | in many cases, &auml; and similar 8-)
>
> Variable-length characters? What the hell are you talking about? UTF-8
> is a variable-length _encoding_ of characters that most certainly are
> intended to require a fixed number of bits. That is, unless you think
> the digit 3 take up only 6 bits while the letter A takes up 7 bits and
> the symbol ? takes up 8. Then you have variable-length characters. Few

> people consider this a meaningful way of talking about variable length.

Wake up, smnell the coffee and learn about 'combiners'. And then *think*
just a little bit, including about thinks like collation, sort order
and similar.

>
> ///

Erik Naggum

unread,
Mar 22, 2002, 10:03:52 PM3/22/02
to
* Sander Vesik

| Wake up, smnell the coffee and learn about 'combiners'. And then *think*
| just a little bit, including about thinks like collation, sort order and
| similar.

Perhaps you are unaware of the character concept as used in Unicode? It
would seem prudent at this time for you to return to the sources and
obtain the information you lack. To wit, what you incompetently refer to
as "combiners" are actually called "combining characters". I suspect you
knew that, too, since nobody _else_ calls them "combiners". But it seems
that you are fighting for your honor, now, not technical correctness, and
I shall leave to you another pathetic attempt to feel good about yourself
when you should acknowledge inferior knowledge and learn something.

Oh, by the way, Unicode has three levels. Study Unicode, and you will
know that they mean and what they do. Hint: "variable-length character"
is an incompetent restatement. A single _glyph_ may be made up of more
than one _character_ and a given glyph may be specifed using more than
one character. If you had known Unicode at all, you would know this.

Sander Vesik

unread,
Mar 23, 2002, 1:51:39 PM3/23/02
to
In comp.lang.scheme Erik Naggum <er...@naggum.net> wrote:
> * Sander Vesik
> | Wake up, smnell the coffee and learn about 'combiners'. And then *think*
> | just a little bit, including about thinks like collation, sort order and
> | similar.
>
> Perhaps you are unaware of the character concept as used in Unicode? It
> would seem prudent at this time for you to return to the sources and
> obtain the information you lack. To wit, what you incompetently refer to
> as "combiners" are actually called "combining characters". I suspect you
> knew that, too, since nobody _else_ calls them "combiners". But it seems
> that you are fighting for your honor, now, not technical correctness, and
> I shall leave to you another pathetic attempt to feel good about yourself
> when you should acknowledge inferior knowledge and learn something.

I don't subscribe to the concept of honour. I also couldn't care less what
you think of me.

>
> Oh, by the way, Unicode has three levels. Study Unicode, and you will
> know that they mean and what they do. Hint: "variable-length character"
> is an incompetent restatement. A single _glyph_ may be made up of more
> than one _character_ and a given glyph may be specifed using more than
> one character. If you had known Unicode at all, you would know this.

It is pointless to think of glyph in any other way than characters - it should
not make any difference whetever adiaresis is represented by one code point
- the precombined one - or two. In fact, if there is a detctable difference
from anything dealing with text strings the implementation is demonstratably
broken.

Erik Naggum

unread,
Mar 23, 2002, 8:46:30 PM3/23/02
to
* Sander Vesik

| I also couldn't care less what you think of me.

You should realize that only people who care a lot, make this point.

| It is pointless to think of glyph in any other way than characters - it
| should not make any difference whetever adiaresis is represented by one
| code point - the precombined one - or two. In fact, if there is a
| detctable difference from anything dealing with text strings the
| implementation is demonstratably broken.

It took the character set community many years to figure out the crucial
conceptual and then practical difference between the "characteristic
glyph" of a character and the character itself, namly that a character
may have more than one glyph, and a glyph may represent more than one
character. If you work with characters as if they were glyphs, you
_will_ lose, and you make just the kind of arguments that were made by
people who did _not_ grasp this difference in the ISO committees back in
1992 and who directly or indirectly caused Unicode to win over the
original ISO 10646 design. Unicode has many concessions to those who
think character sets are also glyph sets, such as the presentation forms,
but that only means that there are different times you would use
different parts of the Unicode code space. Some people who try to use
Unicode completely miss this point.

It also took some _companies_ a really long time to figure the difference
between glyph sets and character sets. (E.g., Apple and Xerox, and, of
course, Microsoft has yet to reinvent the distinction badly in the name
of "innovation", so their ISO 8859-1-like joke violates important rules
for character sets.) I see that you are still in the pre-enlightenment
state of mind and have failed to grasp what Unicode does with its three
levels. I cannot help you, since you appear to stop thinking in order to
protect or defend yourself or whatever (it sure looks like som mideast
"honor" codex to me), but if you just pick up the standard and read its
excellent introductions or even Unicode: A Primer, by Tony Graham, you
will understand a lot more. It does an excellent job of explaining the
distinction between glyph and character. I think you need it much more
than trying to defend yourself by insulting me with your ignorance.

Now, if you want to use or not use combining characters, you make an
effort to convert your input to your preferred form before you start
processing. This isolates the "problem" to a well-defined interface, and
it is no longer a problem in properly designed systems. If you plan to
compare a string with combining characters with one without them, you are
already so confused that there is no point in trying to tell you how
useless this is. This means that thinking in terms of "variable-length
characters" is prima facie evidence of a serious lack of insight _and_ an
attitude problem that something somebody else has done is wrong and that
you know better than everybody else. Neither are problems with Unicode.

Thomas Bushnell, BSG

unread,
Mar 23, 2002, 11:25:49 PM3/23/02
to

So a secondary question; if one is designing a new Common Lisp or
Scheme system, and one is not encumbered by any requirements about
being consistent with existing code, existing operating systems, or
existing communications protocols and interchange formats: that is, if
one gets to design the world over again:

Should the Scheme/CL type "character" hold Unicode characters, or
Unicode glyphs? (It seems clear to me that it should hold characters,
but I might be thinking about it poorly.)

And, whichever answer, why is that the right answer?

Thomas

cr88192

unread,
Mar 23, 2002, 9:02:30 PM3/23/02
to
>
> Should the Scheme/CL type "character" hold Unicode characters, or
> Unicode glyphs? (It seems clear to me that it should hold characters,
> but I might be thinking about it poorly.)
>
> And, whichever answer, why is that the right answer?
>
one could use "the cheap man's unicode" or utf-8.
actually personally I don't care so much about unicode and have held it in
the "possibly later" respect. for now it is not terribly important as I can
just restrict myself to the lower 128 characters.
in any case it sounds simpler to implement than the "codepage" system, so I
will probably use it.

"ich bin einen Amerikaner, und ich tun nicht erweiterter Zeichen noetig"
(don't mind bad grammar, as I don't really know german...).

nevermind...

Erik Naggum

unread,
Mar 24, 2002, 1:51:53 AM3/24/02
to
* tb+u...@becket.net (Thomas Bushnell, BSG)

| Should the Scheme/CL type "character" hold Unicode characters, or
| Unicode glyphs? (It seems clear to me that it should hold characters,
| but I might be thinking about it poorly.)

There are no Unicode glyphs. This properly refers to the equivalence of
a sequence of characters starting with a base character and optinoally
followed combining characters, and "precomposed" characters. This is the
canonical-equivalence of character sequences. A processor of Unicode
text is allowed to replace any character sequence with any of its
canonically-equivalent character sequences. It is in this regard that an
application may want to request a particular composite character either
as one character or a character sequence, and may decide to examine each
coded character element individually or as an interpreted character.
These constitute three different levels of interpretation that it must be
possible to specify. Since an application is explicitly permitted to
choose any of the canonical-equivalent character sequences for a
character, the only reasonable approach is to normalize characters into a
known internal form.

There is one crucial restriction on the ability to use equivalent
character sequences. ISO 10646 defines implementation levels 1, 2 and 3
that, respectively, prohibit all combining characters, allow most
combining characters, and allow all combining characters. This is a very
important part of the whole Unicode effort, but Unicode has elected to
refer to ISO 10646 for this, instead of adopting it. From my personal
communication with high-ranking officials in the Unicode consortium, this
is a political decision, not a technical one, because it was feared that
implementors that would be happy with trivial character-to-glyph--mapping
software (such as a conflation of character and glyph concepts and fonts
that support this conflation), especially in the Latin script cultures,
would simply drop support for the more complex usage of the Latin script
and would fail to implement e.g., Greek properly. Far from being an
enabling technology, it was feared that implementing the full set of
equivalences would be omitted and thus not enable the international
support that was so sought after. ISO 10646, on the other hand, has
realized that implementors will need time to get all this right, and may
choose to defer implementation of Unicode entirely if they are not able
to do it stepwise. ISO 10646 Level 1 is intended to be workable for a
large number of uses, while Level 3 is felt not to have an advantage qua
requirement until languages that require far more than composition and
decomposition to be fully supported. I concur strongly with this.

The character-to-glyph mapping is fraught with problems. One possible
way to do this is actually to use the large private use areas to build
glyphs and then internally use only non-combining characters. The level
of dynamism in the character coding and character-to-glyph mapping here
is so much difficult to get right that the canonical-equivalent sequences
of characters (which is a fairly simple table-lookup process) pales in
comparison. That is, _if_ you allow combining characters, actually being
able to display them and reason about them (such as computing widths or
dealing with character properties of the implicit base character or
converting their case) is far more difficult than decomposing and
composing characters.

As for the scary effect of "variable length" -- if you do not like it,
canonicalize the input stream. This really is an isolatable non-problem.

Erik Naggum

unread,
Mar 24, 2002, 2:00:47 AM3/24/02
to
* Thomas Bushnell, BSG

| So a secondary question; if one is designing a new Common Lisp or Scheme
| system, and one is not encumbered by any requirements about being
| consistent with existing code, existing operating systems, or existing
| communications protocols and interchange formats: that is, if one gets to
| design the world over again:

If we could design the world over again, the _first_ ting I would want to
do is making "capital letter" a combining modifier instead of doubling
the size of the code space required to handle it. Not only would this be
such a strong signal to people not to use case-sensitive identifiers in
programming languages, we would have a far better time as programmers.
E.g., considering the enormous amount of information Braille can squeeze
into only 6 bits, with codes for many common words and codes to switch to
and from digits and to capital letters, the limitations of their code
space has effectively been very beneficial.

Ed L Cashin

unread,
Mar 24, 2002, 11:08:10 PM3/24/02
to
Erik Naggum <er...@naggum.net> writes:

...


> If we could design the world over again, the _first_ ting I would
> want to do is making "capital letter" a combining modifier instead
> of doubling the size of the code space required to handle it. Not
> only would this be such a strong signal to people not to use
> case-sensitive identifiers in programming languages, we would have
> a far better time as programmers.

Could you elaborate on that a bit? I'm interested because it appears
that you're position is that case-sensitivity in identifiers is a Bad
Thing for programming languages.

A general principle of mine is that if things are distinguishable,
they should not be collapsed but the distinction should be preserved
whenever possible. Treating different characters as the same
character, or treating different character sequences as equivalent,
should be postponed as long as possible in order to preserve
information.

Are you suggesting that this principle is inappropriate to apply to
the character sequences that compose identifiers in source code? That
would mean that "ABLE" is the same identifier as "able". I must admit
that when I first found out that current lisps have case-insensitive
symbol names, I thought it reminiscent of BASIC -- kind of a throwback
to a time when memory was much more at a premium. (I know that Lisp
predates BASIC. I'm talking about my reaction.) I'd be happy to hear
a good case for case-insensitive identifiers.

--
--Ed L Cashin | PGP public key:
eca...@uga.edu | http://noserose.net/e/pgp/

Kent M Pitman

unread,
Mar 24, 2002, 11:45:10 PM3/24/02
to
Ed L Cashin <eca...@uga.edu> writes:

> Erik Naggum <er...@naggum.net> writes:
>
> ...
> > If we could design the world over again, the _first_ ting I would
> > want to do is making "capital letter" a combining modifier instead
> > of doubling the size of the code space required to handle it. Not
> > only would this be such a strong signal to people not to use
> > case-sensitive identifiers in programming languages, we would have
> > a far better time as programmers.
>
> Could you elaborate on that a bit? I'm interested because it appears
> that you're position is that case-sensitivity in identifiers is a Bad
> Thing for programming languages.
>
> A general principle of mine is that if things are distinguishable,
> they should not be collapsed but the distinction should be preserved
> whenever possible. Treating different characters as the same
> character, or treating different character sequences as equivalent,
> should be postponed as long as possible in order to preserve
> information.

Psychology experiments have empirically shown that memory is auditory.
That is, when you misremember words, you misremember them by soundalike,
not by lookalike. There is also ample linguistic evidence that the core of
human language is an auditory phenomenon. When languages vary, they first
change in their spoken form and then later writing catches up, not much
vice versa. Since the spoken form has no notation for case differentiation,
the pretty obvious conclusion is that conceptual information is not best
carried in case. People don't remember whether they saw a word written in
uppercase or lowercase, they just remember the word. It is very rare and
quite awkward for someone to say "Use Capitalized-Foo" or
"Use All-Uppercase-FOO" to someone out loud in areas other than computer
science where people have worked themselves into corners by being pedantic
on a "general principle" as in your previous paragraph rather than observing
well-researched truths about how people really think.

Some of us believe that a proper harmonization/synchronization with the
way peoples' brains work is more important than catering to a theoretical
model that some people think would be a nice way for people to think.

I personally have made it a design goal in languages that I've worked on
to think hard about making even programming languages gracefully pronounceable
so that people can talk about programs aloud to each other over dinner, etc.
Modern Lisp has mostly moved away from obscure little names like "rplacd"
and such (a small number being retained mostly for history). For new
concepts, make names like MOST-POSITIVE-FIXNUM not MAXINT.

Even in cased languages, mostly people don't use case to distinguish, they
just use it for controlling the look of code. It's not uncommon for people
to have some things named Foo and others named BAR, but it's rarer for things
to be both named foo and Foo in a context where simple namespacing can't
tell the difference. So often again you don't hear people saying the case
out loud because it can be determined from other factors. At that point,
you might as well let people write stuff in whatever case they want, for
ease of input, and just let code pretty-printers adjust the case to a pretty
look if it's really needed.

IMO, no ordinary code should ever be case-sensitive and it's a darned shame
that XML is uses case-sensitive identifiers. I think it does mainly so it
can service languages that have made a bad design decision ... so it's a
dependent bad decision, not an independent one.

> Are you suggesting that this principle is inappropriate to apply to
> the character sequences that compose identifiers in source code? That
> would mean that "ABLE" is the same identifier as "able".

Yes.

> I must admit
> that when I first found out that current lisps have case-insensitive
> symbol names, I thought it reminiscent of BASIC -- kind of a throwback
> to a time when memory was much more at a premium. (I know that Lisp
> predates BASIC. I'm talking about my reaction.) I'd be happy to hear
> a good case for case-insensitive identifiers.

Cased names are often a substitute in infix languages for having given up
hyphen in a way that got messy. You can't call a variable MOST-POSITIVE-FIXNUM
in most languages, because it thinks you mean MOST - POSITIVE - FIXNUM, a
subtraction. Dylan requires you to put spaces around minus so it can
have both minus and subtraction. Doing MostPositiveFixnum is not very
natural and also forces case to be used in a way that supports separation,
taking away the ability to use case for what it was intended for: supporting
the underlying language. So if I have a word like eBusiness in "English"
and I want to compose it into a function, do I make it be MakeeBusinessName
or MakeEbusinessName or .... personally, I prefer make-eBusiness-name.

It might even be better to use _'s, but it's a shifted character on most
keyboards, and people with weak fingers hate shifting that often, so hyphens
tend to be preferred. make_eBusiness_name might otherwise be better, and
would save confusion with minus sign.

[CL uses uppercase as the canonical case for the case-normalized name,
and that's controversial with some people, but some of us like it. In any
case, it's orthogonal to this other question about case translation.]

In any case, my real point is not to say there's a 100% clear answer here,
but merely to motivate that the choice of case-translation is not archaic
but definitely has support from people who think themselves to be living
in the present.

Erik Naggum

unread,
Mar 25, 2002, 12:06:41 AM3/25/02
to
* Ed L Cashin <eca...@uga.edu>

| Could you elaborate on that a bit? I'm interested because it appears
| that you're position is that case-sensitivity in identifiers is a Bad
| Thing for programming languages.

I consider it a bad thing to believe that A is a different character from
a just because it has a certain "presentation property". I mean, we do
not distinguish characters based on font or face, underlining or color,
and most people realize that these are incidental properties. However,
capitalness of a letter is just as incidental: The fact that a letter is
capitalized depending on such randomness as the position of the word in
the sentence is a very strong indicator that "However" and "however" are
not different words, which is effectively what case-sensitive people
think they are. I tried to publish text without this incidental property
for a while, but it seemed to tick people off even more than calling an
idiot an idiot.

| A general principle of mine is that if things are distinguishable, they
| should not be collapsed but the distinction should be preserved whenever
| possible. Treating different characters as the same character, or
| treating different character sequences as equivalent, should be postponed
| as long as possible in order to preserve information.

If you use colors to distinguish keywords from identifiers in our editor,
can you use a keyword with a different color as an identifier?

| Are you suggesting that this principle is inappropriate to apply to the
| character sequences that compose identifiers in source code? That would
| mean that "ABLE" is the same identifier as "able".

| I must admit that when I first found out that current lisps have
| case-insensitive symbol names, I thought it reminiscent of BASIC -- kind
| of a throwback to a time when memory was much more at a premium.

But this is not the case. The symbol names are case-sensitive, but the
Common Lisp reader maps all unescaped characters to uppercase by default.
You can change this. Symbols are in this fashion just like normal words
in your natural language.

| (I know that Lisp predates BASIC. I'm talking about my reaction.) I'd
| be happy to hear a good case for case-insensitive identifiers.

I think case sensitivity is an abuse of an incidental property. Thus, I
want to hear a good case for case-sensitive identifers. Older languages
did not have this property, but after Unix (which has a case-insensitive
tty mode!), the norm became to distinguish case, largely because there
were no other namespace functionality in early C. Unix also chose to use
lower-case commands whereas Multics had always supported case-folding. I
believe the reason that the Unix people wanted to distinguish case was
that it would require an extra instruction and a lookup table that would
waste a precious 128 bytes of memory in the kernel, while we currently
waste an enormous amount of memory to keep case-folding tables several
times over. In my view, case-sensitive identifiers has become the norm
in a community that has failed to think about proper solutions to their
problems, but rather choose to solve only the immediate problem, much
like C strongly encourages irrelevant micro-optimization. So instead of
being nice to the user, they were nice to the programmer, who did not
have to case-fold the incomding identifiers. I consider moving this
burdon onto the user to be quite user-inimical and actually quite foreign
to people who do not know the character coding standards. I mean, do we
have case-sensitive trademarks, even though we traditionally capitalize
proper names? Are Oracle and ORACLE different companies any more than
ORACLE in red boldface 14 point Times Roman is a different company than
ORACLE in blue italic 12 point Helvetica?

There has definitely been "paradigm shift" in computer people's view on
case, but not in non-computer people. Internet protocols like SMTP use
case-insensitive commands. The DNS is case-insensitive. SGML is
case-insensitive and so is HTML. Because of the huge problems we face
with case-folding Unicode (which must be done with a table of some kind),
some people have figured that we should _not_ do case-folding. That is
the wrong solution to the problem. The right solution to the problem is
to get rid of case as a character property.

Now, assume that we no longer have different character codes for lower-
case and upper-case letters. Would there be any difference in how we
look at text on computer screens, in print, etc? No, of course not.
Therefore, people would still be able to distinguish identifiers visually
based on case if they want to -- just like the Common Lisp reader allows
you to write |car| to refer to the symbol named "car", and |CAR| to refer
to the symbol named "CAR", and just like Unix can deal with upper- and
lower-case letters even when iuclc and olcuc is in effect with the xcase
option by backslashing the real uppercase characters in your input. (In
Common Lisp, you would backslash a lower-case character in the default
reader mode, and the printer will escape those characters that should not
be case-folded.) However, being able to do something and actually doing
it are two very different things. E.g., on TOPS-20, you could use
lower-case letters in filenames if you really wanted to, by prefixing
them with ^V. Very few people bothered to do this because typing it in
was a hassle. I do not propose any change to how we input upper and
lower case, but with the anal-retentive approach to saving bits, which
has even gone so far as to write FooBarZot instead of foo-bar-zot, the
probablity that they C freaks would have chosen case-sensitivity would be
remarkably lower -- if we could go back and design the world over...

Christopher Browne

unread,
Mar 25, 2002, 12:28:15 AM3/25/02
to
Centuries ago, Nostradamus foresaw when Kent M Pitman <pit...@world.std.com> would write:
> Psychology experiments have empirically shown that memory is
> auditory. That is, when you misremember words, you misremember them
> by soundalike, not by lookalike. There is also ample linguistic
> evidence that the core of human language is an auditory phenomenon.
> When languages vary, they first change in their spoken form and then
> later writing catches up, not much vice versa.

I agree in part.

The "western" languages certainly are representative of that; our
languages are largely a way of taking what we say and putting it on
paper. (Computers being an insignificant "blip" thus far in the
history of it :-).)

My understanding of the Asian languages is that they are often _not_
such a representation; what is written is _not_ an account what is
spoken. Writing is, there, representative of a separate language. In
more clearly "pictographic" languages, there may _not_ be an auditory
form except as constructed afterwards.

That caveat being given, words don't usually sound different when they
have different casing and aren't usually recognized as being
different.

"That" is not a different word from "that."
--
(reverse (concatenate 'string "ac.notelrac.teneerf@" "454aa"))
http://www.ntlug.org/~cbbrowne/linux.html
"Of _course_ it's the murder weapon. Who would frame someone with a
fake?"

Duane Rettig

unread,
Mar 25, 2002, 5:00:01 AM3/25/02
to
Ed L Cashin <eca...@uga.edu> writes:

> A general principle of mine is that if things are distinguishable,
> they should not be collapsed but the distinction should be preserved
> whenever possible. Treating different characters as the same
> character, or treating different character sequences as equivalent,
> should be postponed as long as possible in order to preserve
> information.

This is your opinion, and many people agree with you, but many do not,
as well. This is a very controversial subject. And it's not just in
comp.lang.lisp that you'll find this same controversy; at about the same
time as our last discussion here there was a similar one raging on
comp.arch. The difference was that here the case-insensitive style being
advocated was (of course) the case-folding style that the Common Lisp
reader standardizes, and in comp.arch the predominant case-insensitive
style being argued was the "case-preserving" style, which is the kind
of recognition style that both Mac and Windows filesystems support
(i.e. first reference gets internalized as originally specified, but
subsequent references are matched against the filename without regard
to case). This case-preserving insensitive style was being pitted
against the Unix case-sensitive style. Of course, neither side
changed the other's mind.

Arguing case-sensitivity is very similar to arguing endianness; there
are good arguments for both big-endian and little-endian, and neither
side is fully right or fully wrong, though a decision must usually be
made, because it is generally hard to mix the two together in the same
machine.

> Are you suggesting that this principle is inappropriate to apply to
> the character sequences that compose identifiers in source code? That
> would mean that "ABLE" is the same identifier as "able". I must admit
> that when I first found out that current lisps have case-insensitive
> symbol names, I thought it reminiscent of BASIC -- kind of a throwback
> to a time when memory was much more at a premium. (I know that Lisp
> predates BASIC. I'm talking about my reaction.) I'd be happy to hear
> a good case for case-insensitive identifiers.

First, I'll note (as others have) that Common Lisp does have
case-sensitive identifiers, and always has. It is the reader that
is specified to fold to uppercase by default. And even the
standard CL reader is highly configurable, to allow cases to be
specified by readtable options.

Second, the choice of case-sensitivity or not is not bounded by
time. Going back to the endianness question, some engineers 10
years ago said "the little-endian side has lost". However, I
suspect that if you count all of the little-endian machines in
existence today, you find it hard to justify that claim. In
fact, even many computers which are generally considered to be
big-endian are now architected to allow for either endianness.

Finally, I personally believe in choice. Our own product has
always allowed one to choose whether to decide on the Common Lisp
specified case-insensitive reader, or whether to configure the reader
to be case-sensitive by default. Our customer base has always taken
advantage of that choice, with anywhere from approximately 20% to 35%
choosing the case-sensitive mode, and the majority choosing the Common
Lisp (case-insensitive, folding to uppercase) mode. And of course,
this does not account for people who use lisps of both modes for
different purposes. Nowadays, there is a slight increase in
case-sensitive mode for the purpose of interfacing relatively directly
with some currently popular case-sensitive languages. The point,
though, is that we have always provided a choice, and always intend
to provide a choice.

In fact, Kent Pitman recently sent us a proposal for unifying
the two major case-modes that Allegro CL provides, in such a
way that the two can exist in the same lisp simultaneously.
We have an rfe (request for enhancement document) which starts
with his proposal as a basis. I would love to see us succeed
in making this or any similar unification, and I was excited to
see Kent's proposal when he sent it to us.

It's all about choice. Calling the case-insensitive choice a
"throwback" is the same as calling it invalid (or no longer
valid). And based on my own experience here and in comp.arch,
that is simply incorrect. People still choose both styles,
and probably always will.

Matthias Blume

unread,
Mar 25, 2002, 8:17:16 AM3/25/02
to
Erik Naggum <er...@naggum.net> writes:

> * Ed L Cashin <eca...@uga.edu>
> | Could you elaborate on that a bit? I'm interested because it appears
> | that you're position is that case-sensitivity in identifiers is a Bad
> | Thing for programming languages.
>
> I consider it a bad thing to believe that A is a different character from
> a just because it has a certain "presentation property". I mean, we do
> not distinguish characters based on font or face, underlining or color,
> and most people realize that these are incidental properties. However,
> capitalness of a letter is just as incidental: The fact that a letter is
> capitalized depending on such randomness as the position of the word in
> the sentence is a very strong indicator that "However" and "however" are
> not different words, which is effectively what case-sensitive people
> think they are.

This is not strictly true in all (natural) languages.

Example 1: German:
- no 1-1 correspondence between upper-case and lower-case (there is one
letter that only exists in the lower-case set)
- some words change class, meaning, and pronunciation when going from
one case to the other (example: Weg vs. weg)
- case is used (or at least has been -- until it became non-pc in some
circles) to put semantic fine points into print (e.g., capitalization of
the second person in letters for politeness)

Example 2: Japanese
- there is no distinction between upper-case and lower-case at all
- HOWEVER: there are still two distinct sets of the phonetic characters
called "hiragana" and "katakana". Either one could spell the entire
language, but usage of the two sets again depends on things like
origin of the word in question, emphasis, style, etc.
One could think of katakana as the upper-case version of hiragana.
Usage is often analogous, for example one would sometimes find
hiragana words spelled in katakana for EMPHASIS.
- Written Japanese also uses kanji (Chinese characters), all of which could
be spelled either in hiragana or katakana. Unfortunately, the mapping
between kanji and hiragana is many-to-many, which shows that the "is the
same word" relationship is not an equivalence relation because it is
not transitive: "hashi" (chopsticks) and "hashi" (bridge) are spelled
exactly the same in hiragana (but are pronounced slightly differently),
but the kanji for the respective words are not the same. OTOH, "kyou"
and "konnichi" are clearly not the same words when spelled phonetically,
but both correspond to the same kanji combination. There are literally
thousands of examples for this in Japanese (which does not make it particularly
easy to learn :-).

Example 3: English
- Speaking of "him" and speaking of "Him" are clearly semantically very different.

Example 4: Mathematics (well, this one is not "natural", after all...)
- In the "language of mathematics" we frequently make semantic distinctions
between typographically different versions of the "same" character.

Anyway, all I wanted to say was that the distinction between different
versions of a character set are not completely incidental in many
(most?) natural languages. I do not want to use this as as argument
for or against case-sensitive identifiers in programming languages,
since I do not think that programming languages should in any form or
manner be modelled after natural ones. (However, I must admit that I
personally prefer being able to use mixed case when programming.)

Matthias

Erik Naggum

unread,
Mar 25, 2002, 9:14:10 AM3/25/02
to
* Matthias Blume <matt...@shimizu-blume.com>

| This is not strictly true in all (natural) languages.

All of these arguments indicate that using the capital letter for the
sentence-initial word is a very bad design choice for a written language;
it violates that strong sense of difference that those who want it to
exist focus so strongly on. However, I would argue that the sheer
acceptability of destroying the importance of the capital letter in the
sentence-intiial word cannot be ignored. When I tried to _preserve_ the
case of the word despite its position in the sentence, this was regarded
as Very Wrong by a bunch of hostile lunatics. This indicated to me that
case is _primarily_ incidental, since the intrinsic role can at any time
be overridden by the incidental role -- specifically, you have no idea
whatsoever what the capitalization of the sentence-initial word would be
if it were moved, yet this causes absolutely no problem for anyone.

| Anyway, all I wanted to say was that the distinction between different
| versions of a character set are not completely incidental in many (most?)
| natural languages.

In real life, nothing is ever completely anything. People use and abuse
case "because it's there". This would not change if capital letters were
coded with a "flag" that communicated capitalness. On the contrary, if
we had such a flag, the natural development is to have _two_ flags: One
for the incidental capital and one for the intrinsic capital. In either
case, the display and the coding properties of a character should be
separated. You provided an excellent example of this with hiragana and
katakana.

| I do not want to use this as as argument for or against case-sensitive
| identifiers in programming languages, since I do not think that
| programming languages should in any form or manner be modelled after
| natural ones.

That is not the argument. Please try to understand this. The point is
that I have taken the liberty to design the world over again, backing up
to _before_ computer geeks coded their character sets, and making a
crucial change to the coding of upper-case vs lower-case characters. The
names "upper-case" and "lower-case" refer to typographic characteristics,
not meaning. Meaning may be coded separately from typography, just as we
do in almost every other case,

| (However, I must admit that I personally prefer being able to use mixed
| case when programming.)

If it had been most costly for you to achieve this, in terms of "knowing"
that you would waste additional space to encode capital letters, would
you still have done preferred it? I believe, from the reactions to the
extended experiment with not randmoly upcasing the sentence-initial word,
that people would be inclined to accept a coding overhead for that role,
as well as for proper nouns, but randmonly and liberally sprinkling such
overhead throughout identifiers in order to achieve an unnatural visual
effect only because it could be done, would most likely not happen. As
Common Lisp uses the hyphen to separate words, which would have no higher
overhead than embedded capital letters, other languages would have far
less inclination to make this horrible mistake, and would therefore not
_require_ case-sensitivity.

Whether the programmers would prefer a case-folding or a case-preserving
case-insensitivty is an open question, but at least designing languages
and coding conventions to use case would not likely happen if case was
regarded as just as incidental as color or typeface.

Matthias Blume

unread,
Mar 25, 2002, 10:40:50 AM3/25/02
to
Erik Naggum <er...@naggum.net> writes:

> [ ... ] The point is
> that I have taken the liberty to design the world over again [...]

Oh, how I'd *love* to live in a world where Erik Naggum is God... :-)

Erik Naggum

unread,
Mar 25, 2002, 11:11:59 AM3/25/02
to
* Matthias Blume <matt...@shimizu-blume.com>

| Oh, how I'd *love* to live in a world where Erik Naggum is God... :-)

Yeah, me too. Then I could force you to pay attention to the premises
that start a discussion instead of completely ignoring the context.
Please see <32259420...@naggum.net>, and pay particular attention to
what Thomas Bushnell wrote.

Sheesh, some people.

Matthias Blume

unread,
Mar 25, 2002, 12:35:46 PM3/25/02
to
Erik Naggum <er...@naggum.net> writes:

> * Matthias Blume <matt...@shimizu-blume.com>
> | Oh, how I'd *love* to live in a world where Erik Naggum is God... :-)
>
> Yeah, me too.

I was under the impression that you thought you already did. :-)

> Then I could force you to pay attention to the premises
> that start a discussion instead of completely ignoring the context.
> Please see <32259420...@naggum.net>, and pay particular attention to
> what Thomas Bushnell wrote.

To be frank, I do not care *one bit* about what this discussion was
originally about. I was merely commenting on your claim about
capitalization being "incidental". The debate of whether or not
case-sensitive identifiers in programming languages are Good or Evil,
or which character set design use up more bits than others, etc., bore
me.

Matthias

Kent M Pitman

unread,
Mar 25, 2002, 12:59:59 PM3/25/02
to
Matthias Blume <matt...@shimizu-blume.com> writes:

> Erik Naggum <er...@naggum.net> writes:
>
> > * Matthias Blume <matt...@shimizu-blume.com>


> > | Oh, how I'd *love* to live in a world where Erik Naggum is God... :-)
> >

> > Yeah, me too.
>
> I was under the impression that you thought you already did. :-)
>
> > Then I could force you to pay attention to the premises
> > that start a discussion instead of completely ignoring the context.
> > Please see <32259420...@naggum.net>, and pay particular attention to
> > what Thomas Bushnell wrote.
>
> To be frank, I do not care *one bit* about what this discussion was
> originally about. I was merely commenting on your claim about
> capitalization being "incidental". The debate of whether or not
> case-sensitive identifiers in programming languages are Good or Evil,
> or which character set design use up more bits than others, etc., bore
> me.

Capitalization _is_ incidental. It is ceremonially marked in written
text, but my impression based on a basic knowledge of linguistics and
a casual outside view of German [I don't purport to speak the
langauge] is that German people may claim that "weg" and "Weg" are
different words, but the capitalization is not pronounced audibly, so
there is generally enough contextual information to disambiguate in
speech. Certainly this is the case for English situations like "God
loves you." and "The god loves you." These are different words, God.
One is a proper name and one isn't. But if it were miscapitalized
"god loves you" or "The God loves you". It is possible for there to
be ambiguity in spite of this in some cases, but it's also possible to
have ambiguity in the case of correct case, too. Human language is
not precise. But normally where a confusion is common, some audible
notation arises to disambiguate. And, incidentally, the audible
notation is [to my knowledge] never the addition of the word
"uppercase" or "lowercase" because that just isn't the issue in play.
It's usually the addition of a guide word, a case marking, a
determiner, etc.

Matthias Blume

unread,
Mar 25, 2002, 1:43:11 PM3/25/02
to
Kent M Pitman <pit...@world.std.com> writes:

> [ ... ] outside view of German [I don't purport to speak the


> langauge] is that German people may claim that "weg" and "Weg" are
> different words, but the capitalization is not pronounced audibly,

The two words are pronounced very differently.

> so there is generally enough contextual information to disambiguate in
> speech.

Ok, so everything that can be inferred from context is "incidental"
then? Most spelling mistakes can be inferred from context, so should
we make programming languages tolerate them? (It has been tried, as
you know.)

Anyway, this whole debate is supremely silly, IMHO. Fortunately
neither you nor Erik get to dictate the rules, at least not for those
languages that I speak or program in...

Matthias

Thomas Bushnell, BSG

unread,
Mar 25, 2002, 1:56:42 PM3/25/02
to
Erik Naggum <er...@naggum.net> writes:

> Yeah, me too. Then I could force you to pay attention to the premises
> that start a discussion instead of completely ignoring the context.
> Please see <32259420...@naggum.net>, and pay particular attention to
> what Thomas Bushnell wrote.

So, getting back to my original question about charset implementations
in Lisp/Scheme (though actually Smalltalk or any such
dynamically-typed language will have the same questions and probably
the same kinds of solutions), I've done some more study and thinking,
so let me try again. My previous question was a tad innocent, it
appears, because I was unaware of the great changes that have taken
place in Unicode since the last time I read through it and grokked the
whole thing (which was back at version 1.2 or something).

I haven't fully internalized the terminology yet, though I'm trying.
So please bear with any minor terminological gaffes (and correct them,
too).

The GNU/Linux world is rapidly converging on using UTF-8 to hold
31-bit Unicode values. Part of the reason it does this is so that
existing byte streams of Latin-1 characters can (pretty much) be used
without modification, and it allows "soft conversion" of existing
code, which is quite easy and thus helps everybody switch.

But I'm thinking about a "design the world over again" kind of
strategy. Now Erik is certainly right that capitalization *should* be
a combining character kind of thing. So let me stipulate that I want
to take Unicode as-is; I get to design *my computer system*, subject
to the a priori constraint that Unicode has done a *lot* of work, so I
will accept slight deficiencies if they help Unicode work right on the
system. So I'll take the existing Unicode encodings, even if they
don't do capitals just like we'd want.

But I don't get to redesign existing communications protocols and
such; however, that's an externalization issue, and for internal use
on the system, such protocols don't matter. Similar comments apply
for existing filesystems formats, file conventions, and the like.

Now, I *could* just use UTF-8 internally, but that seems rather
foolish. I think it's obvious that characters should be "immediately"
represented in pointer values in the way that fixnums are.

Now the Universal Character Set is officially 31 bits, but only 16
bits are in use now, and it is expected that at most 21 bits will be
used. So that means it's pretty easy to make sure the whole space of
UCS values fits in an immediate representation. That's fine for
working with actively used data.

However, strings that are going to be kept around a long time should,
it seems to me, be stored more compactly. Essentially all strings
will be in the Basic Multilingual Plane, so they can fit in 16 bits.
That means there would be two underlying string datatypes. I don't
think this is a serious problem. Is it worth having a third (for
8-bit characters) so that Latin-1 files don't have to be inflated by a
factor of two? It seems to me that this would be important too.
Basically then we would have strings which are UCS-4, UCS-2 and
Latin-1 restricted (internally, not visibly to users).

So even if strings are "compressed" this way, they are not UTF-8.
That's Right Out. They are just direct UCS values. Procedures like
string-set! therefore might have to inflate (and thus copy) the entire
string if a value outside the range is stored. But that's ok with me;
I don't think it's a serious lose.

So is this sane?

Ok, then the second question is about combining characters. Level 1
support is really not appropriate here. It would be nice to support
Level 3. But perhaps Level 2 with Hangul Jamo characters [are those
required for Level 2?] would be good enough.

It seems to me that it's most appropriate to use Normalization Form
D. Or is that crazy? It has the advantage of holding all the Level 3
values in a consistent way. (Since precombined characters do not
exist for all possibilities, Normalization Form C results in some
characters precombined and some not, right?)

And finally, should the Lisp/Scheme "character" data type refer to a
single UCS code point, or should it refer to a base character together
with all the combining characters that are attached to it?

Thomas

Kent M Pitman

unread,
Mar 25, 2002, 2:30:34 PM3/25/02
to
Matthias Blume <matt...@shimizu-blume.com> writes:

> Kent M Pitman <pit...@world.std.com> writes:
>
> > [ ... ] outside view of German [I don't purport to speak the
> > langauge] is that German people may claim that "weg" and "Weg" are
> > different words, but the capitalization is not pronounced audibly,
>
> The two words are pronounced very differently.
>
> > so there is generally enough contextual information to disambiguate in
> > speech.
>
> Ok, so everything that can be inferred from context is "incidental"
> then? Most spelling mistakes can be inferred from context, so should
> we make programming languages tolerate them? (It has been tried, as
> you know.)

Please read Aristotle on Virtue Ethics. The mean between unreasonable
extremes is not something with a fixed answer. The fact that its precise
point in design space is not uniquely determined does not mean it should
not be something people strive for. If anyone seriously wants to defend
spelling errors as a good design theory, we could have a discussion about
it. Otherwise, it's a pointless red herring. I do, however, contend a
theory behind the point of view CL has, and was merely describing that
point of view.



> Anyway, this whole debate is supremely silly, IMHO. Fortunately
> neither you nor Erik get to dictate the rules, at least not for those
> languages that I speak or program in...

We aren't dictating rules, and I personally don't really appreciate this
attempt to recast my defense of an arbitrary but reasonable design choice
into some sort of attempt at an ignorant attempt to control the world.

All we have done is to try to explain the present state of affairs based
on an attempt for harmony with something people do with a great deal of
statistical regularity. Probably there is no deed that everyone does with
any predictability other than, as they say, death and taxes, but it seems
inappropriate to base design on the idea that this implies no other
large scale regularities worth checking into...

Thomas Bushnell, BSG

unread,
Mar 25, 2002, 2:44:55 PM3/25/02
to
Kent M Pitman <pit...@world.std.com> writes:

> Please read Aristotle on Virtue Ethics. The mean between unreasonable
> extremes is not something with a fixed answer.

It can also only be determined by the man with a particular virtue
known as "practical wisdom", as well. And, with practical wisdom,
comes all the virtues, not just one or two. Which means that only the
person with true virtue is even able to tell what the Right Thing to
do is.

Aristotle's talk of a "mean" is a metaphor, of course. It's some kind
of balance, some kind of "just enough" notion.

Some medievals liked to poo poo this by taking it overliterally, with
a rather snide attack. Thomas Aquinas, however, liked the "mean"
theory, and here's how he treats of the snide attackers (from the
"Quastio disputata de virtutibus in communi", Article 13, Objection 7
and the response):

Whether virtue lies in a mean. It seems not....Boethius in "On
arithmetic" speaks of a threefold mean, the arithmetical, as 6
between 4 and 8 which is an equal distance from both, and the
geometrical, as 6 between 9 and 4, which is proportionally the same
distance from both, and the harmonic or musical mean, as 3 between 6
and 2 because there is the same proportion of one extreme to the
other, namely, 3 (which is the different between 6 and 9) to 1 which
is the difference between 2 and 3. But none of these means is found
in virtue, since the mean of virtue does not relate equally to
extremes, nor in a quantitative way nor according to some proportion
of the extremes and differences. Therefore, virtue does not lie in
the mean.

[replies Thomas]: It should be said that the means spoken of by
Boethius lie in things and thus are not relevant to the mean of
virtue which is determined by reason. Justice seems to be an
exception since it involves both a mean in things and another
according to reason: The arithmetical mean is relevant to exchange
and the geometrical to distribution, as is clear from [Aristotle's
Nicomachean] Ethics [book] 5.

Anyway, I'd recommend the Nicomachean Ethics of Aristotle to anyone
interested in thinking. You'll find it aggravating; he's quite
unmodern and actually quite bogus in a lot of ways, but he is truly
important and it will change a great deal about how you think, if you
take it seriously.

Thomas

Michael Parker

unread,
Mar 25, 2002, 3:13:25 PM3/25/02
to
Erik Naggum <er...@naggum.net> wrote in message news:<32260544...@naggum.net>...
> ... but at least designing languages

> and coding conventions to use case would not likely happen if case was
> regarded as just as incidental as color or typeface.

OTOH, if terminals had gotten color and typefaces earlier, maybe
programming languages would have evolved to use them. Maybe give
each namespace its own color, so you would specify the value of a
name by putting it in blue, the function by using red, keywords in
italics, macros in green. The mind boggles at the possibilities.
In fact, if you want to boggle your mind, see

http://www.sleepless-night.com/cgi-bin/twiki/view/Main/ColorForth

Which describes Chuck Moore's latest dialect of forth that does
this sort of thing.

Matthias Blume

unread,
Mar 25, 2002, 3:08:13 PM3/25/02
to
Kent M Pitman <pit...@world.std.com> writes:

> We aren't dictating rules, and I personally don't really appreciate this
> attempt to recast my defense of an arbitrary but reasonable design choice
> into some sort of attempt at an ignorant attempt to control the world.

Sorry, I was unreasonably hash on you, Kent.

> All we have done is to try to explain the present state of affairs based
> on an attempt for harmony with something people do with a great deal of
> statistical regularity.

As I have tried to point out, this sort of regularity isn't actually
quite as regular as some try to make it. The Japanese language is a
great example (although there the distiction is not called "uppercase vs.
lowercase").

By the way, here is an example in a case-sensitive natural language
where the distinction between uppercase and lowercase gets
*pronounced*: "mit" vs. "MIT" in German. The first means "with" and is
pronounced like "mitt", the second is the Massachussetts Institute of
Technology and is pronounced like speakers of English would pronounce
it: em-ay-tee. I think that there are enough examples of this around
so that making a distinction between uppercase and lowercase is
warranted in the natural language case. Again, I do not think that
this needs to be in any way correlated with the PL case.

Matthias

Andreas Eder

unread,
Mar 25, 2002, 4:02:01 PM3/25/02
to
Kent M Pitman <pit...@world.std.com> writes:

> Capitalization _is_ incidental. It is ceremonially marked in written
> text, but my impression based on a basic knowledge of linguistics and
> a casual outside view of German [I don't purport to speak the
> langauge] is that German people may claim that "weg" and "Weg" are
> different words, but the capitalization is not pronounced audibly, so
> there is generally enough contextual information to disambiguate in
> speech.

Well, in fact 'Weg' and 'weg' *are* pronounced differently, one with a
long 'e' and the other with a short one - that is because they are
different words. Should you incidentally start a sentence with 'weg',
thus writing it with capital 'W' it would still be pronounced like
'weg'. This might be difficult to understand, but that is how natural
languages are, I guess.

Andreas
--
Wherever I lay my .emacs, there愀 my $HOME.

Dorai Sitaram

unread,
Mar 25, 2002, 5:13:50 PM3/25/02
to
In article <m3zo0wu...@elgin.eder.de>,


To me, that case is indeed ornamental is supported by
the fact that it appears to be permissible to
upper-case a German sentence in its entirety
without construing it as a loss of information.

BITTE EIN BIT
ICH BIN EIN BERLINER
DIE MAUER MUSS WEG!

usw.

Ie, things like titles, slogans, and billboards, but
also consider the GPL or other license text in the
German, where large globs of the prose are in all caps.
Legal prose, it seems to me, would especially not court
information loss in this manner if it was felt there
really was a risk.

I'm curious: Is there an example, however
frivolous, where WEG in an all-caps sentence
could be ambiguous?

BTW, the {Weg, weg} pair seems very like the {produce
(noun), produce (verb)} pair in English. Like Weg/weg,
produce/produce are pronounced differently.
However, they don't rely on capitalization, even
though the grammatical context used to disambiguate
between them has fewer cues than the German.

--d

Pierre R. Mai

unread,
Mar 25, 2002, 4:44:54 PM3/25/02
to
Matthias Blume <matt...@shimizu-blume.com> writes:

> By the way, here is an example in a case-sensitive natural language
> where the distinction between uppercase and lowercase gets
> *pronounced*: "mit" vs. "MIT" in German. The first means "with" and is
> pronounced like "mitt", the second is the Massachussetts Institute of
> Technology and is pronounced like speakers of English would pronounce
> it: em-ay-tee. I think that there are enough examples of this around

This is "supremely silly", if there is such a thing, even ignoring for
the time that MIT is neither a german word, nor a german abbreviation,
and that probably a large number of german speakers will not recognize
MIT as standing for "the" MIT, nor pronounce it as speakers of English
would. The different pronounciation of mit vs. MIT doesn't result
from the difference in case, at all. If you receive a telex that
informs you of an invitation to "the mit", you will pronounce "mit"
just as you would "MIT". qed.

Of course that doesn't mean that case should be completely ignored, it
just means that case is just another attribute of text, like fonts,
and that there is little reason to encode it in the character.

It also means that you want to distinguish between mit (with) and MIT
(the institute) not based on case, but based on packages, i.e.

(and (not (eq 'german-words:mit 'universities:mit))
;; And now an example where case will not help in disambiguation
;; namely the sequence "tub", standing for both the english word
;; tub and the common abbreviation for the Technische Universität
;; Berlin
(not (eq 'english-words:tub 'universities:tub)))

Regs, Pierre.

--
Pierre R. Mai <pm...@acm.org> http://www.pmsf.de/pmai/
The most likely way for the world to be destroyed, most experts agree,
is by accident. That's where we come in; we're computer professionals.
We cause accidents. -- Nathaniel Borenstein

Matthias Blume

unread,
Mar 25, 2002, 6:00:57 PM3/25/02
to
ds...@goldshoe.gte.com (Dorai Sitaram) writes:

> I'm curious: Is there an example, however
> frivolous, where WEG in an all-caps sentence
> could be ambiguous?

Yes, there is a joke about a stupid person who tries to figure out
which street he is in and comes up with

"We are on the trail with the nukes."

because he misread the slogan

"WEG MIT DEN ATOMWAFFEN" (meaning "GET RID OF THE NUKES")

as a streetsign.

> BTW, the {Weg, weg} pair seems very like the {produce
> (noun), produce (verb)} pair in English. Like Weg/weg,
> produce/produce are pronounced differently.

In this case, there is at best a very remote semantic relationship (if
any). It is definitely nowhere near a noun/verb sort of thing.

Matthias

Kent M Pitman

unread,
Mar 25, 2002, 6:46:19 PM3/25/02
to
Matthias Blume <matt...@shimizu-blume.com> writes:

> ds...@goldshoe.gte.com (Dorai Sitaram) writes:
>
> > I'm curious: Is there an example, however
> > frivolous, where WEG in an all-caps sentence
> > could be ambiguous?
>
> Yes, there is a joke about a stupid person who tries to figure out
> which street he is in and comes up with
>
> "We are on the trail with the nukes."
>
> because he misread the slogan
>
> "WEG MIT DEN ATOMWAFFEN" (meaning "GET RID OF THE NUKES")
>
> as a streetsign.

Yes, but this kind of confusion can happen whether case is involved or not,
and I think it's not fair to ascribe it to case as the principal cause.
We have signs on our highways that say "FINE FOR LITTERING". Writing
them in lowercase won't help. ;-)

> > BTW, the {Weg, weg} pair seems very like the {produce
> > (noun), produce (verb)} pair in English. Like Weg/weg,
> > produce/produce are pronounced differently.
>
> In this case, there is at best a very remote semantic relationship (if
> any). It is definitely nowhere near a noun/verb sort of thing.

There is a phenomenon in English speech wherein stress matters, too,
and we sometimes italicize not just to control emphasis but to
actively disambiguate. A prime example of this is an effect called
anaphoric de-stressing (that is, lessening stress in order to turn a
reference into an anaphoric reference--that is, a reference to a previously
noun entity--instead of a non-anaphoric referenc--, that is, a reference to a
newly introduced entity). The example I've seen is a story of a newsreader
misreading an account of how a man, upon hearing his wife had had an affair
with another man, had said he wanted to shoot the bastard. (Note how the
sentence changes meaning, depending on whether if put stress on _shoot_
or on _bastard_.) Written Englsh doesn't mark this distinction in writing,
even though it's present and my some stretch important in spoken English.
People figure it out.

Erik Naggum

unread,
Mar 25, 2002, 7:35:09 PM3/25/02
to
* Matthias Blume

| I was under the impression that you thought you already did. :-)

Wipe that moronic grin off your face, dimwit. What your retarded
impression of other people might be should not concern anybody else.
Such despicably stupid behavior should have been punished by people who
cared about you. Why have they not?

| To be frank, I do not care *one bit* about what this discussion was
| originally about.

Of course not. Moronic grins are a pretty strong indicator of impaired
mental capacity, starting with the sheer inability to take other people
seriously.

| I was merely commenting on your claim about capitalization being
| "incidental". The debate of whether or not case-sensitive identifiers in
| programming languages are Good or Evil, or which character set design use
| up more bits than others, etc., bore me.

I tried to suggest _strongly_ that you should go back to daytime TV, but
did you get it? No. How amazingly dense you must be.

Erik Naggum

unread,
Mar 25, 2002, 8:34:19 PM3/25/02
to
* Thomas Bushnell, BSG

| The GNU/Linux world is rapidly converging on using UTF-8 to hold 31-bit
| Unicode values. Part of the reason it does this is so that existing byte
| streams of Latin-1 characters can (pretty much) be used without
| modification, and it allows "soft conversion" of existing code, which is
| quite easy and thus helps everybody switch.

UTF-8 is in fact extreemly hostile to applications that would otherwise
have dealt with ISO 8859-1. The addition of a prefix byte has some very
serious implications. UTF-8 is an inefficient and stupid format that
should never have been proposed. However, it has computational elegance
in that it is a stateless encoding. I maintain that encoding is stateful
regardless of whether it is made explicit or not. I therefore strongly
suggest that serious users of Unicode employ the compression scheme that
has been described in Unicode Technical Report #6. I recommend reading
this technical report.

Incidentally, if I could design things all over again, I would most
probably have used a pure 16-bit character set from the get-go. None of
this annoying 7- or 8-bit stuff. Well, actually, I would have opted for
more than 16-bit units -- it is way too small. I think I would have
wanted the smallest storage unit of a computer to be 20 bits wide. That
would have allowed addressing of 4G of today's bytes with only 20 bits.
But I digress...

| So even if strings are "compressed" this way, they are not UTF-8. That's
| Right Out. They are just direct UCS values. Procedures like string-set!
| therefore might have to inflate (and thus copy) the entire string if a
| value outside the range is stored. But that's ok with me; I don't think
| it's a serious lose.

There is some value to the C/Unix concept of a string as a small stream.
Most parsing of strings needs to parse so from start to end, so there is
no point in optimizing them for direct access. However, a string would
then be different from a vector of characters. It would, conceptually,
be more like a list of characters, but with a more compact encoding, of
course. Emacs MULE, with all its horrible faults, has taken a stream
approach to character sequences and then added direct access into it,
which has become amazingly expensive.

I believe that trying to make "string" both a stream and a vector at the
same time is futile and only leads to very serious problems. The default
representation of a string should be stream, not a vector, and accessors
should use the stream, such as with make-string-{input,output}-stream,
with new operators like dostring, instead of trying to use the string as
a vector when it clearly is not. The character concept needs to be able
to accomodate this, too. Such pervasive changes are of course not free.

| Ok, then the second question is about combining characters. Level 1
| support is really not appropriate here. It would be nice to support
| Level 3. But perhaps Level 2 with Hangul Jamo characters [are those
| required for Level 2?] would be good enough.

Level 2 requires every other combining character except Hangul Jamo.

| It seems to me that it's most appropriate to use Normalization Form D.

I agree for the streams approach. I think it is important to make sure
that there is a single code for all character sequences in the stream
when it is converted to a vector. The private use space should be used
for these things, and a mapping to and from character sequences should be
maintained such that if a private use character is queried for its
properties, those of the character sequence would be returned.

| Or is that crazy? It has the advantage of holding all the Level 3 values
| in a consistent way. (Since precombined characters do not exist for all
| possibilities, Normalization Form C results in some characters
| precombined and some not, right?)

Correct.

| And finally, should the Lisp/Scheme "character" data type refer to a
| single UCS code point, or should it refer to a base character together
| with all the combining characters that are attached to it?

Primarily the code point, but both, effectively, by using the private use
space as outlined above.

Erik Naggum

unread,
Mar 25, 2002, 8:53:11 PM3/25/02
to
* Michael Parker

| OTOH, if terminals had gotten color and typefaces earlier, maybe
| programming languages would have evolved to use them.

Only if we had also had a stateless coding for them, statefulness being
so frigthening to the kinds of programmers who are likely to invent new
syntaxes.

| Maybe give each namespace its own color, so you would specify the value
| of a name by putting it in blue, the function by using red, keywords in
| italics, macros in green. The mind boggles at the possibilities.

Especially if they also used XML to write it all, and then we can use
cascading style sheets to control both background and foreground color.
And programmers would have be selected from those who are not color
blind. This is unlikely to succeed, since the current selection from
those who can spell has not been successful, either, and that is at least
something you can learn.

Thanks for the URL, though. My mind boggles at statements like these:
"With the huge RAM of modern computers, an operating system is no longer
necessary."

Erik Naggum

unread,
Mar 25, 2002, 9:21:25 PM3/25/02
to
* Matthias Blume

| The two words are pronounced very differently.

But so is house and house, distinguished by a voiced and unvoiced s.
Some languages also have tonemes, not just phonemes. Norwegian is among
them. The phonemes of the Noreegian words for "farmers", "prayers" and
"beans" are the same, but the tonemes differ. Immigrants often have
farmers for dinner and purchase produce directly from beans as a result.
The word for "farmers" is spelled "bønder" but "beans" and "prayers" are
both spelled "bønner". Note that this is not a question of stress. All
three stress the first syllable exactly the same, and do not stress the
final syllable.

| Anyway, this whole debate is supremely silly, IMHO.

Then you are supremely silly who continue to post your drivel to it.

| Fortunately neither you nor Erik get to dictate the rules, at least not
| for those languages that I speak or program in...

OF course, you are a Scheme freak and a tourist in comp.lang.lisp, the
very canonicalization of the irresponsible trouble-maker who thinks he is
an outsider to the community he torments with "you are silly who do it
differently from me" attitudes. Thank you for contributing to the
_impression_ that Scheme is the language of choice of deranged lunatics.

Thomas Bushnell, BSG

unread,
Mar 25, 2002, 9:25:27 PM3/25/02
to
Erik Naggum <er...@naggum.net> writes:

> Some languages also have tonemes, not just phonemes. Norwegian is among
> them. The phonemes of the Noreegian words for "farmers", "prayers" and
> "beans" are the same, but the tonemes differ. Immigrants often have
> farmers for dinner and purchase produce directly from beans as a result.
> The word for "farmers" is spelled "bønder" but "beans" and "prayers" are
> both spelled "bønner". Note that this is not a question of stress. All
> three stress the first syllable exactly the same, and do not stress the
> final syllable.

Huh? If they are different words, then *by the definition of a
phoneme* the sound which distinguishes them is a phoneme. What is a
"toneme"?

Erik Naggum

unread,
Mar 25, 2002, 9:43:54 PM3/25/02
to
* Matthias Blume

| Sorry, I was unreasonably hash on you, Kent.

You are a clever little asshole, aren't you?

| By the way, here is an example in a case-sensitive natural language where
| the distinction between uppercase and lowercase gets *pronounced*: "mit"
| vs. "MIT" in German. The first means "with" and is pronounced like
| "mitt", the second is the Massachussetts Institute of Technology and is
| pronounced like speakers of English would pronounce it: em-ay-tee.

Geez, dude, you are _so_ full of yourself. No wonder you think this is
supremely silly -- your own contributions are ludicrous and stupid.

Whether the M, I, and T of the words that make up "MIT" are capitalized
or not is incidental. That one chooses to uppercase initials of words
is precisely what I am talking about. Sheesh, some people.

| I think that there are enough examples of this around so that making a
| distinction between uppercase and lowercase is warranted in the natural
| language case.

Hello? Of course these is a _distinction_ you incredibly retarded jerk!
Have you been arguing for a _distinction_? Man, how can you survive
being so goddamn _stupid_? Nobody has argued against a distinction, you
insufferably arrogant moron. The point is how it should be REPRESENTED!
(Incidental capitalization added purely for effect.) Is it even possible
to be so unintelligent that this is not something you could have avoided
by _thinking_ a little? Of course, you are in this "you guys are silly"
mode, so thinking on your own is out of the question, but the whole point
is that you are so unconscious and so unwilling to engage your brain to
understand what somebody else argues that you effectively reduce the
discussion to your pathetically ignorant level. Of _course_ there is a
distinction! Geez, you are such an idiot. The question is: should that
visible distinction have been coded to represent the incidental quality
apart from the intrinsic quality, and the answer is so "advanced" that
your puny little brain will in all likelihood not grasp its simplicity.

Let me give your sevrely reduced mental capacity a simple enough example
that you might actually be inspired to think about the ramifications.
The symbol for Ångstrøm in Unicode is exactly the same as the glyph for
the letter A with ring above, because the guy's name was spelled with
that letter, just like Celsius and Fahrenheit, but all these three
letters should never be lowercased even though they are upper-case
letters. This is an intrinsic quality. For this reason, Unicode has
chosen to represent them as _symbols_, not letters. The same applies to
Greek omega, pi, rho, and sigma, which are different symbols in each
case. Can you wrap your exceptionally pitiful brain around these few and
simple examples to perhaps grasp that incidental qualities and intrinsic
qualities are important? Or are you so unphilosophical and such a
leering idiot with a moronic grin permanently attached to his skull that
being able to grasp what other people have thought about before you has
become impossible for you?

On wonder you think those who think are _gods_ in their own mind: If you
had been able to think at all, you would probably experience _several_
revelations of such magnitude that one "god" would not be enough.

| Again, I do not think that this needs to be in any way correlated with
| the PL case.

Is the stuff you are smoking legal? Go back to your Scheme community,
where being supremely silly is not considered rude to your compatriots.

Matthias Blume

unread,
Mar 25, 2002, 10:33:47 PM3/25/02
to
Erik Naggum <er...@naggum.net> writes:

> Some languages also have tonemes, not just phonemes. Norwegian is among
> them. The phonemes of the Noreegian words for "farmers", "prayers" and
> "beans" are the same, but the tonemes differ. Immigrants often have
> farmers for dinner and purchase produce directly from beans as a result.
> The word for "farmers" is spelled "bønder" but "beans" and "prayers" are
> both spelled "bønner". Note that this is not a question of stress. All
> three stress the first syllable exactly the same, and do not stress the
> final syllable.

So what? What does this have to do with anything? I have already
pointed out examples (albeit not from Norwegian, which I don't know at
all) for this phenomenon. Pronunciation and spelling are often at
odds. Therefore, one cannot argue on the basis of phonetics which
visual distinctions in the written language matter and which ones
don't. As far as I am concerned, uppercase and lowercase are not the
same. In German, this is simply a fact of how the written language is
defined. Getting the capitalization wrong is a spelling error just
like using the wrong vowel, missing an 'h' somewhere, using 'ss' where
'ß' should be used, joining words where they ought to be separated and
vice versa, and so and and so forth. Of course, many of these
distinctions are redundant to some degree. Case distinctions are not
the only redundancies. Should we abolish all whitespace just because
with some practice one can infer where word boundaries are? I haven't
seen anyone suggesting this. (And again, there are precedents for
such a things, for example in some far eastern languages where words
are not visibly separated in the written language.)

> OF course, you are a Scheme freak and a tourist in comp.lang.lisp, the
> very canonicalization of the irresponsible trouble-maker who thinks he is
> an outsider to the community he torments with "you are silly who do it
> differently from me" attitudes. Thank you for contributing to the
> _impression_ that Scheme is the language of choice of deranged lunatics.

Quite funny that you think I am a Scheme person...
(Especially considering that Scheme, like CL, uses case-insensitive identifiers.)

Matthias

Christopher Browne

unread,
Mar 25, 2002, 10:30:56 PM3/25/02
to
The world rejoiced as Erik Naggum <er...@naggum.net> wrote:
> * Thomas Bushnell, BSG
> | The GNU/Linux world is rapidly converging on using UTF-8 to hold 31-bit
> | Unicode values. Part of the reason it does this is so that existing byte
> | streams of Latin-1 characters can (pretty much) be used without
> | modification, and it allows "soft conversion" of existing code, which is
> | quite easy and thus helps everybody switch.
>
> UTF-8 is in fact extreemly hostile to applications that would otherwise
> have dealt with ISO 8859-1. The addition of a prefix byte has some very
> serious implications. UTF-8 is an inefficient and stupid format that
> should never have been proposed. However, it has computational elegance
> in that it is a stateless encoding. I maintain that encoding is stateful
> regardless of whether it is made explicit or not. I therefore strongly
> suggest that serious users of Unicode employ the compression scheme that
> has been described in Unicode Technical Report #6. I recommend reading
> this technical report.
>
> Incidentally, if I could design things all over again, I would most
> probably have used a pure 16-bit character set from the get-go. None of
> this annoying 7- or 8-bit stuff. Well, actually, I would have opted for
> more than 16-bit units -- it is way too small. I think I would have
> wanted the smallest storage unit of a computer to be 20 bits wide. That
> would have allowed addressing of 4G of today's bytes with only 20 bits.
> But I digress...

You should have a chat with Charles Moore, of Forth fame. He
designed, using a CAD system he wrote in Forth, called OK, a 20 bit
microprocessor that (surprise, surprise... NOT!) has an instruction
set designed specifically for Forth.

Something that is unfortunate is that the 36 bit processors basically
died off in favor of 32 bit ones. Which means we have great gobs of
algorithms that assume 32 bit word sizes, with the only leap anyone
can conceive of being to 64 bits, and meaning that if you need a tag
bit or two for this or that, 32 bit operations wind up Sucking Bad.

But I digress, too...
--
(concatenate 'string "cbbrowne" "@ntlug.org")
http://www.ntlug.org/~cbbrowne/oses.html
Rules of the Evil Overlord #230. "I will not procrastinate regarding
any ritual granting immortality." <http://www.eviloverlord.com/>

Christopher Browne

unread,
Mar 25, 2002, 10:43:15 PM3/25/02
to
In the last exciting episode, Erik Naggum <er...@naggum.net> wrote::

> * Michael Parker
> | OTOH, if terminals had gotten color and typefaces earlier, maybe
> | programming languages would have evolved to use them.
>
> Only if we had also had a stateless coding for them, statefulness being
> so frigthening to the kinds of programmers who are likely to invent new
> syntaxes.
>
> | Maybe give each namespace its own color, so you would specify the value
> | of a name by putting it in blue, the function by using red, keywords in
> | italics, macros in green. The mind boggles at the possibilities.
>
> Especially if they also used XML to write it all, and then we can use
> cascading style sheets to control both background and foreground color.
> And programmers would have be selected from those who are not color
> blind. This is unlikely to succeed, since the current selection from
> those who can spell has not been successful, either, and that is at least
> something you can learn.
>
> Thanks for the URL, though. My mind boggles at statements like these:
> "With the huge RAM of modern computers, an operating system is no longer
> necessary."

Yes, that seems rather a strange comment.

Note that one of Moore's more-publicized quasi-recent projects
involved building a CAD system for designing microprocessors.

His approach was to basically write the application-cum-operating
system based on a tiny kernel of Forth instructions which basically
meant he started with 80486 assembler, and built on top of that.

Apparently it offered vast opportunities to avoid all kinds of cruft
that tends to get built into CAD systems, but what it really amounted
to was that he built his system as an embedded system on top of bare
Intel metal.

I think a lot of his argument is that people keep building cruft on
top of cruft, when they might be better off with a _good_ embedded
system.

Consider the horrors of MS Office: We might be better off if, instead
of continually being mandated by the latest bloatware upgrade to
upgrade their system to the latest "Pentium IV with more memory than
anyone could _conceive_ of ten years ago," people bought cheap
electronic typewriters with bare bits of computing power.

If people spent their time _typing_, instead of trying to figure out
which menu allows them to change some bit of formatting, they might
get more work done. Consider that back in the old days, Unix used to
run in 128K words of memory, and CP/M machines could handle word
processing, spreadsheets, and databases in 56K of RAM. The notion
that you need 256MB of RAM to realistically Windows XP should be
offensive.

In any case, Moore is a fascinating character. He is perhaps not
always to be taken seriously, but he's had more inspired ideas than
most people ever learn about...
--
(reverse (concatenate 'string "gro.mca@" "enworbbc"))
http://www.ntlug.org/~cbbrowne/wp.html
"Cars move huge weights at high speeds by controlling violent
explosions many times a second. ...car analogies are always fatal..."
-- <west...@my-dejanews.com>

Erik Naggum

unread,
Mar 25, 2002, 11:11:32 PM3/25/02
to
* Thomas Bushnell, BSG

| Huh? If they are different words, then *by the definition of a phoneme*
| the sound which distinguishes them is a phoneme. What is a "toneme"?

Stress is generally not considered to be a difference in phoneme.

The sound is exactly the same, but whether you have entering, departing,
rising, falling, high, low, up-down, down-up, or level tone can and does
change the meaning of the word. Thai, for instance, has explicit tone
markers. Chinese has different ideographs for words that are pronounced
with the same phonemes and different tonemes.

Consider the phonemes of the word "really". The toneme is the difference
in pronunciation between "Really?" and "Really." and "Really!".

French, for instance, has no stress, but tends to use maringally shorter
and longer vowels. They also have no tonemes, so they French have very
_serious_ problems dealing with other languages and sound ridiculous in
almost every other language than their own.

Erik Naggum

unread,
Mar 25, 2002, 11:20:45 PM3/25/02
to
* Matthias Blume

| So what? What does this have to do with anything?

Why are you still talking? This is "supremely silly" and you keep
blabbering? What for?

| As far as I am concerned, uppercase and lowercase are not the same.

Nobody has said they are. Please just grasp this, OK? That some
distinction is incidental does mean that it is not there. I wonder what
your limited brainpower has concluded that this discussion is all about
when you are so devoid of understanding. Geez, you are _so_ stupid.

Thomas Bushnell, BSG

unread,
Mar 25, 2002, 11:32:20 PM3/25/02
to
Erik Naggum <er...@naggum.net> writes:

> * Thomas Bushnell, BSG
> | Huh? If they are different words, then *by the definition of a phoneme*
> | the sound which distinguishes them is a phoneme. What is a "toneme"?
>
> Stress is generally not considered to be a difference in phoneme.

Oh, ok. That's a good point; the term "phoneme" is ambiguous I think.
Tonal differences are sometimes phonemic and sometimes not, but I now
understand what you mean. Whether a tonal or length difference should
be officially phonemic is a matter style and not any real linguistics,
as far as I can tell.

> Consider the phonemes of the word "really". The toneme is the difference
> in pronunciation between "Really?" and "Really." and "Really!".

Yeah, but there it's a matter of marking, which is different than
tone. A better example in English is between homographs like
"conduct" (a noun, stress on the first syllable) and "conduct" (a
verb, stress on the second syllable).

Because stress is contextual, it's not normally counted as a phoneme.
Tone and length are not contextual, so I think those are usually
counted as phonemes. But (as I said above) I think this is a pretty
gray area.

> French, for instance, has no stress, but tends to use maringally shorter
> and longer vowels. They also have no tonemes, so they French have very
> _serious_ problems dealing with other languages and sound ridiculous in
> almost every other language than their own.

Actually French does have stress as a word marker; the last syllable
of each word gets a stress. (Obviously, stress is therefore not
phonemic in French.)

Thomas

Matthias Blume

unread,
Mar 25, 2002, 11:46:34 PM3/25/02
to
Erik Naggum <er...@naggum.net> writes:

> | As far as I am concerned, uppercase and lowercase are not the same.
>
> Nobody has said they are. Please just grasp this, OK? That some
> distinction is incidental does mean that it is not there.

I meant: they are intrinsically not the same.

cr88192

unread,
Mar 25, 2002, 9:17:00 PM3/25/02
to
>
> Something that is unfortunate is that the 36 bit processors basically
> died off in favor of 32 bit ones. Which means we have great gobs of
> algorithms that assume 32 bit word sizes, with the only leap anyone
> can conceive of being to 64 bits, and meaning that if you need a tag
> bit or two for this or that, 32 bit operations wind up Sucking Bad.
>
hello, personally I don't really know what the big difference is...
I would have imagined that in any case a slightly larger word size would
have been useful, but it is not...
sometimes for some of my code I use 48 bit ints (when 32 bits is too small
and 64 is overkill). I would think that with 36 bits the next size up would
be 72, and 36 is not evenly divisible by 8 so you would need a different
byte size as well (ie: 9 or 12).
sorry, I don't really know of byte sizes other than 8...
am I missing something?

(little has changed in my life since before, except that I am working on an
os now... again...).

Erik Naggum

unread,
Mar 26, 2002, 1:06:55 AM3/26/02
to
* cr88192 <cr8...@hotmail.com>

| sorry, I don't really know of byte sizes other than 8...
| am I missing something?

Yes. A "byte" is only a contiguous sequence of bits in a machine word,
and has been used that way by most vendors, for us notably DEC, which
contributed the machine instructions we know as LDB and DPB and the
notion of a byte specifier, which has bit position in word and length in
bits. Failure to support LDB and DPB in hardware is very costly for a
large number of useful operations, but on an a byte-addressable world
with 8-bit bytes, using anything smaller than bytes that might cross byte
boundaries has serious penalties. In a word-addressable world, this
saves a lot of memory, even relative to the byte-adressable machines. C
has bit fields because it was intended to run on Honewyell 6000, which
had 36-bit words, so its "char" was 9 bits wide. (See page 34 of
Kernighan & Ritchie, 1st ed.)

IBM chose a more specific terminology: 4-bit nybbles (the same spelling
deviation as "byte" from "bite"), 8-bit bytes, 16-bit half-words, 32-bit
words, and 64-bit double-words. On the PDP-10, we had 36-bit words,
18-bit half-words (and halfword instructions), but bytes were all over
the place. I knwo several people who think this is a much better design
than the stupid 8-bit design we have today. Sadly, only several, not
millions and millions who think Intel's designs are better just because
they can buy them.

Erik Naggum

unread,
Mar 26, 2002, 1:21:08 AM3/26/02
to
* Thomas Bushnell, BSG

| Oh, ok. That's a good point; the term "phoneme" is ambiguous I think.
| Tonal differences are sometimes phonemic and sometimes not, but I now
| understand what you mean. Whether a tonal or length difference should be
| officially phonemic is a matter style and not any real linguistics, as
| far as I can tell.

*sigh*  My native language has tonemes. Yours does not. Trust me on
this, OK? Go look it up if you doubt me.

Tone is the musical tone with which you pronounce a phoneme, or more
precisely, with the relative direction of the change of the tone
throughout the word.

> Consider the phonemes of the word "really". The toneme is the difference
> in pronunciation between "Really?" and "Really." and "Really!".

| Yeah, but there it's a matter of marking, which is different than tone.

*sigh  No, this is a tone difference. The rising tone at the end of a
question is precisely this -- tone. One does not usually talk about
tonemes when dealing with the changing meaning of a sentence, but it is
the same idea.

| A better example in English is between homographs like "conduct" (a noun,
| stress on the first syllable) and "conduct" (a verb, stress on the second
| syllable).

No, that would be stress, not tone. I was trying to give you an example
of what tone is, not how the same sequence of phonemes can have different
meaning in differing ways.

| Because stress is contextual, it's not normally counted as a phoneme.
| Tone and length are not contextual, so I think those are usually counted
| as phonemes. But (as I said above) I think this is a pretty gray area.

No, it is not a grey area. It just does not apply to English. Study
Norwegian or Thai.

Erik Naggum

unread,
Mar 26, 2002, 1:22:00 AM3/26/02
to
* Matthias Blume

| I meant: they are intrinsically not the same.

Then your position is not only misguided, but utterly false, you
supremely silly man.

Florian Hars

unread,
Mar 26, 2002, 2:47:10 AM3/26/02
to
Erik Naggum schrieb im Artikel <32261124...@naggum.net>:
> * Thomas Bushnell, BSG

>| Tonal differences are sometimes phonemic and sometimes not
>
> *sigh*  My native language has tonemes. Yours does not. Trust me on
> this, OK? Go look it up if you doubt me.

Some data points on "toneme" from the web:
The American Heritage® Dictionary:
A type of phoneme
The Concise Oxford Dictionary of Linguistics:
A unit of pitch, especially in tone languages, treated as or
analogously to a phoneme.
http://www.factmonster.com:
a phoneme consisting of a contrastive feature of tone in a tone
language

Yours, Florian.

Thomas Bushnell, BSG

unread,
Mar 26, 2002, 3:13:37 AM3/26/02
to
Erik Naggum <er...@naggum.net> writes:

> * Thomas Bushnell, BSG
> | Oh, ok. That's a good point; the term "phoneme" is ambiguous I think.
> | Tonal differences are sometimes phonemic and sometimes not, but I now
> | understand what you mean. Whether a tonal or length difference should be
> | officially phonemic is a matter style and not any real linguistics, as
> | far as I can tell.
>
> *sigh*  My native language has tonemes. Yours does not. Trust me on
> this, OK? Go look it up if you doubt me.

I'm trusting you about the way Norwegian works, and I'm trying to
understand it in the terminology used in English to speak about
linguistics.

I do understand perfectly well what tone is.

> | Because stress is contextual, it's not normally counted as a phoneme.
> | Tone and length are not contextual, so I think those are usually counted
> | as phonemes. But (as I said above) I think this is a pretty gray area.
>
> No, it is not a grey area. It just does not apply to English. Study
> Norwegian or Thai.

I know perfectly well what tone is.

The question is whether tonal difference is a phonemic difference.

Since a phoneme is a minimal unit distinguishing two words, if there
are two words that differ only in tone, the difference must therefore
be phonemic.

I mentioned stress (in English, with the "conduct" example), because
stress is also sometimes thought not to distinguish phonemes, but
really it does.

What is a gray area is whether how rigid one wants to be about the
definition of "phoneme".

Thomas

Alain Picard

unread,
Mar 26, 2002, 3:29:38 AM3/26/02
to
Erik Naggum <er...@naggum.net> writes:

>
> French, for instance, has no stress, but tends to use maringally shorter
> and longer vowels. They also have no tonemes, so they French have very
> _serious_ problems dealing with other languages and sound ridiculous in
> almost every other language than their own.

What makes you think they don't sound equally ridiculous in French? ;-)

In high school, I never did understand what the English teacher was
going on about, with his "iambic pentameter" stuff. If you come from
a monotonic language, the whole thing doesn't make a lot of sense.
Oh well, _our_ rhymes are a lot more exact.

*Years* later, having married an anglophone and lived in english
society for a few years, it was finally explained to me that english
has this "stress" thing... my accent improved markedly after that.

--
It would be difficult to construe Larry Wall, in article
this as a feature. <1995May29....@netlabs.com>

Erik Naggum

unread,
Mar 26, 2002, 4:18:22 AM3/26/02
to
* Thomas Bushnell, BSG

| Since a phoneme is a minimal unit distinguishing two words, if there are
| two words that differ only in tone, the difference must therefore be
| phonemic.

Apparently, this is how some people see it -- I have not seen a
difference in tone referred to as "phonemic". However, phonemes are
supposed to be discrete elments of speech. A toneme is not -- the change
in tone usually spans several phonemes. Therefore, it is either a
phoneme of its own, which seems odd, or an additional speech element.
If a "phoneme" is the _only_ smallest unit of sound it appears not
to be possible to enumerate the phonemes of a language, any longer.

| I mentioned stress (in English, with the "conduct" example), because
| stress is also sometimes thought not to distinguish phonemes, but
| really it does.

So when something, anything distinguishes phonemes, they become two?
That does not appear to be useful. It seems rather to mulitply them
without bounds.

| What is a gray area is whether how rigid one wants to be about the
| definition of "phoneme".

Seems if you can put whatever you want into to, it is rendered useless.

Nils Goesche

unread,
Mar 26, 2002, 4:27:15 AM3/26/02
to
In article <87663kp...@orion.bln.pmsf.de>, Pierre R. Mai wrote:
> Matthias Blume <matt...@shimizu-blume.com> writes:
>
>> By the way, here is an example in a case-sensitive natural language
>> where the distinction between uppercase and lowercase gets
>> *pronounced*: "mit" vs. "MIT" in German. The first means "with" and is
>> pronounced like "mitt", the second is the Massachussetts Institute of
>> Technology and is pronounced like speakers of English would pronounce
>> it: em-ay-tee. I think that there are enough examples of this around
>
> This is "supremely silly", if there is such a thing, even ignoring for
> the time that MIT is neither a german word, nor a german abbreviation,
> and that probably a large number of german speakers will not recognize
> MIT as standing for "the" MIT, nor pronounce it as speakers of English
> would. The different pronounciation of mit vs. MIT doesn't result
> from the difference in case, at all. If you receive a telex that
> informs you of an invitation to "the mit", you will pronounce "mit"
> just as you would "MIT". qed.

I agree that the MIT example is silly, but there are much better ones.
Compare

``Der Philosoph fuehlt sich im allgemeinen wohl.''

with

``Der Philosoph fuehlt sich im Allgemeinen wohl.''

In speech, you can tell the difference because in the latter case
the main accent is on ``Allgemeinen'', whereas in the former it
is on ``wohl''. Incidentally, the totally moronic ``spelling
reform'' that happened a few years ago breaks this example,
like numerous others, but fortunately at least my favorite
newspaper continues to use the old spelling.

Regards,
--
Nils Goesche
"Don't ask for whom the <CTRL-G> tolls."

PGP key ID 0x42B32FC9

Thomas A. Russ

unread,
Mar 25, 2002, 5:39:22 PM3/25/02
to
Ed L Cashin <eca...@uga.edu> writes:
> I must admit
> that when I first found out that current lisps have case-insensitive
> symbol names, I thought it reminiscent of BASIC -- kind of a throwback
> to a time when memory was much more at a premium. (I know that Lisp
> predates BASIC. I'm talking about my reaction.) I'd be happy to hear
> a good case for case-insensitive identifiers.

Point of Information:

Lisp does not have case-insensitive symbols names. They are most
certainly case-sensitive. It is just that the default setting of the
input reader make it inconvenient to use mixed case identifiers, since
you need to escape either the lower-case characters

f\o\o gives the symbol named "Foo"

or the entire symbol name

|Foo| also gives the symbol named "Foo"

The default behavior of the reading process is to map all (non-escaped)
characters to uppercase.

There are ways around this, such as setting the readtable case to
:PRESERVE, which as you might suspect, preserves the input case. With
that setting one could type Foo and get the symbol named "Foo". But all
of the built-in Common Lisp symbols are defined to be in uppercase, so
that would mean having to type the built-in symbols all in uppercase.

It so happens that there is a very clever way around this, with
readtable case :INVERT, which inverts the case of all identifiers which
use either only lowercase or only uppercase, but preserves the case of
mixed case identifiers. This probably gives you the best of both
worlds.

[Aside: Kent or anyone: Who came up with the idea for the :INVERT
readtable case? It seems rather clever, even if in a slightly demented
sort of way.]

-Tom.

--
Thomas A. Russ, USC/Information Sciences Institute t...@isi.edu

Nils Goesche

unread,
Mar 26, 2002, 4:46:41 AM3/26/02
to
In article <a7o7eu$i8a$1...@news.gte.com>, Dorai Sitaram wrote:
> In article <m3zo0wu...@elgin.eder.de>,
> Andreas Eder <Andrea...@t-online.de> wrote:

>>Well, in fact 'Weg' and 'weg' *are* pronounced differently, one with a
>>long 'e' and the other with a short one - that is because they are
>>different words. Should you incidentally start a sentence with 'weg',
>>thus writing it with capital 'W' it would still be pronounced like
>>'weg'. This might be difficult to understand, but that is how natural
>>languages are, I guess.
>
> To me, that case is indeed ornamental is supported by
> the fact that it appears to be permissible to
> upper-case a German sentence in its entirety
> without construing it as a loss of information.

This is true for /some/ sentences, but not /all/ sentences (I posted
an example). In the seventies, it was popular among radical leftists
to write everything in lowercase. The slogan was: ``Wer groszschreibt
ist auch fuer's Groszkapital'', freely translated something like
``Friends of capitalization are also friends of the capital'', or
some such. It is /significantly/ harder to read a German text
without proper capitalization.

Matthias Blume

unread,
Mar 26, 2002, 7:09:52 AM3/26/02
to
Erik Naggum <er...@naggum.net> writes:

> * Matthias Blume
> | I meant: they are intrinsically not the same.
>

> Then your position is not only misguided, but utterly false, [ ... ]

I do understand what you are trying to get at with you distinction
between "intrinsic" and "incidental". However, I think that this very
distinction itself is in the "incidental" category. It is absolutely
not clear where to draw the line between "intrinsic" features of
spelling and "incidental" ones. Why single out the sentence-initial
capitalization rule? Why not also get rid of repeated consonants? At
least in some languages, the rules about those are just as
"incidental" as the capitalization rules.

What you are trying to do is separate content and form. I wish you
good luck in this endeavor, but also predict that it is doomed from
the beginning. If you could actually do it, that would be great: We
could store the intrinsic parts of German text and then "render" it
according to the spelling and grammar rules of the day. (There
recently has been a big official -- and very controversial -- reform
of the spelling rules in German. They attack precisely some of those
"incidental" aspects, but strangly, leave others (such as
sentence-initial capitalization) untouched.)

By the way, I assume that the abuse that you heap on people when you
get into one of your famous tirades is "incidental"...

Matthias

Erik Naggum

unread,
Mar 26, 2002, 8:29:15 AM3/26/02
to
* Matthias Blume <matt...@shimizu-blume.com>

| I do understand what you are trying to get at with you distinction
| between "intrinsic" and "incidental". However, I think that this very
| distinction itself is in the "incidental" category.

Up-casing all letters in a heading is clearly incidental. Capitalization
of each non-preposition in a title ic clearly incidental. Capitalization
of the sentence-initial word is clearly incidental. Capitalization of
proper names is perhaps intrinsic, in which case information should not
be lost when you write "Smith said ..." and later change it to "After a
brief pause, Smith said ...", should be recoverable from titles and
headlines, and should therefore be regarded as information that
incidental capitalization actually _destroys_. If you think intrinsic
capitalization is so important, you would have objected to the incidental
capitalization or upcasing of words because of their information loss.
You do not, so I conclude that you are completely _unconcerned_ with this
loss of information from incidental capitalization, and therefore do
_not_ regard intrinsic capitalization as important.

| It is absolutely not clear where to draw the line between "intrinsic"
| features of spelling and "incidental" ones.

It appears that you think that intrinsic-p = (complement incidental-p).
This is unwarrented, and most of your argumentation just falls to pieces
because you believe this and argue argainst a negative.

| Why single out the sentence-initial capitalization rule? Why not also
| get rid of repeated consonants?

What the fuck are you talking about? Geez, are you for real?

| At least in some languages, the rules about those are just as
| "incidental" as the capitalization rules.

Whnat _are_ you unable to deal with?

| What you are trying to do is separate content and form. I wish you good
| luck in this endeavor, but also predict that it is doomed from the
| beginning.

Look, you are so stupid that this is getting seriously boring: The whole
context of the discussion is what if we could design things all over?
Your insipid complaints and your moronic attitude problems are hostile.

| By the way, I assume that the abuse that you heap on people when you get
| into one of your famous tirades is "incidental"...

In your case, stupidity and hostility seem to be intrinsic. Just THINK,
and you will find a nicer side of me. Be an annoying asshole, and you
find me unpleasant. It really is that simple. Some people _are_ no more
than annoying assholes and think it is my fault. This is not so, but it
sure seem to make annoying assholes happier to think it is. This is how
they remain annoying assholes.

Seth Gordon

unread,
Mar 26, 2002, 10:35:49 AM3/26/02
to
Christopher Browne wrote:
>
> Consider the horrors of MS Office: We might be better off if, instead
> of continually being mandated by the latest bloatware upgrade to
> upgrade their system to the latest "Pentium IV with more memory than
> anyone could _conceive_ of ten years ago," people bought cheap
> electronic typewriters with bare bits of computing power.
>
> If people spent their time _typing_, instead of trying to figure out
> which menu allows them to change some bit of formatting, they might
> get more work done.

You assume that people (or corporate purchasing agents) spent money on
more expensive word processors only because they wanted to get more work
done. I am not sure this is true. :-)

--
"Any fool can write code that a computer can understand.
Good programmers write code that humans can understand."
--Martin Fowler
// seth gordon // wi/mit ctr for genome research //
// se...@genome.wi.mit.edu // standard disclaimer //

Ingvar Mattsson

unread,
Mar 26, 2002, 10:36:26 AM3/26/02
to
tb+u...@becket.net (Thomas Bushnell, BSG) writes:

> Erik Naggum <er...@naggum.net> writes:
[SNIP]


> > Consider the phonemes of the word "really". The toneme is the difference
> > in pronunciation between "Really?" and "Really." and "Really!".
>
> Yeah, but there it's a matter of marking, which is different than
> tone. A better example in English is between homographs like
> "conduct" (a noun, stress on the first syllable) and "conduct" (a
> verb, stress on the second syllable).

If I remember my youth (and Norwegian workmate from then) correctly,
Norwegian is similar to Swedish in this regard and there is no
difference in stress pattern between quite a few similar-sounding-but-
obviously-different words.

//Ingvar
--
(defun m (a b) (cond ((or a b) (cons (car a) (m b (cdr a)))) (t ())))

Thomas Bushnell, BSG

unread,
Mar 26, 2002, 12:43:52 PM3/26/02
to
Erik Naggum <er...@naggum.net> writes:

> | What is a gray area is whether how rigid one wants to be about the
> | definition of "phoneme".
>
> Seems if you can put whatever you want into to, it is rendered useless.

That's one-bit thinking. It's a gray area, not a rigid definition,
and I thank you for pointing out the complexities in the case of
Norwegian.

Jochen Schmidt

unread,
Mar 26, 2002, 1:55:56 PM3/26/02
to
Kent M Pitman wrote:


> Capitalization _is_ incidental. It is ceremonially marked in written
> text, but my impression based on a basic knowledge of linguistics and
> a casual outside view of German [I don't purport to speak the
> langauge] is that German people may claim that "weg" and "Weg" are
> different words, but the capitalization is not pronounced audibly, so
> there is generally enough contextual information to disambiguate in
> speech.

In German nouns are written uppercase. "Weg" is a noun which means "way" in
English. On the other side the verb "weg" means "away". They are pronounced
differently ("Weg", a long "e" and "weg" a short one) and they are of
course semantically different symbols. Capitalizing nouns in German is a
redundant thing - there is no bigger problem than in English. In English
there are too examples of colliding nouns and verbs (for example "saw") but
they are resolved out of their grammatical or semantic role in the
sentence. If a verb is at the beginning of a sentence it is capitalized too
in German (like in "Weg hier!" <-> "Away from here!").
Capitalizing nouns in written text makes the process of disambiguation
easier but raises the burden for the writer. Many people in Germany who do
not know German grammatics well enough have problems with proper
capitalization when writing. Even if they know it well enough - many
chatters in IRC write completely without capitalization because they can
type faster this way.

ciao,
Jochen

--
http://www.dataheaven.de

Erik Naggum

unread,
Mar 26, 2002, 1:08:17 PM3/26/02
to
* Thomas Bushnell, BSG

| It's a gray area, not a rigid definition, and I thank you for pointing
| out the complexities in the case of Norwegian.

Bare hyggelig!

Nils Goesche

unread,
Mar 26, 2002, 1:16:02 PM3/26/02
to
In article <a7qcif$jhc$1...@rznews2.rrze.uni-erlangen.de>, Jochen Schmidt wrote:
>
> In German nouns are written uppercase. "Weg" is a noun which means "way" in
> English. On the other side the verb "weg" means "away". They are pronounced
> differently ("Weg", a long "e" and "weg" a short one) and they are of
> course semantically different symbols. Capitalizing nouns in German is a
> redundant thing - there is no bigger problem than in English.

There is lots of redundancy in both spelling and human language. If
you omitted every third vowel, I'd still understand your writings.
And there are many people who say that it /is/ a bigger problem than
in English (in English, there is /no/ problem). Possibly because of
greater freedom in the order of words in German, I don't know.

> Even if they know it well enough - many
> chatters in IRC write completely without capitalization because they can
> type faster this way.

They also write `n8' instead of `Good night!'. I hope you are not
proposing to eliminate all redundancy in our language :-)

Jochen Schmidt

unread,
Mar 26, 2002, 3:22:17 PM3/26/02
to
Nils Goesche wrote:

> In article <a7qcif$jhc$1...@rznews2.rrze.uni-erlangen.de>, Jochen Schmidt
> wrote:
>>
>> In German nouns are written uppercase. "Weg" is a noun which means "way"
>> in English. On the other side the verb "weg" means "away". They are
>> pronounced differently ("Weg", a long "e" and "weg" a short one) and they
>> are of course semantically different symbols. Capitalizing nouns in
>> German is a redundant thing - there is no bigger problem than in English.
>
> There is lots of redundancy in both spelling and human language. If
> you omitted every third vowel, I'd still understand your writings.

Of course. Redunancy can be a good thing to raise transmission safety and
to speed up recognition. If you omitted every third vowel you would force
people to search for words that fit the pattern and raise the burden for
the reader.

> And there are many people who say that it /is/ a bigger problem than
> in English (in English, there is /no/ problem). Possibly because of
> greater freedom in the order of words in German, I don't know.

This is probably true - my claim was merely a subjective one. As others
pointed out it is no _big_ problem to understand German texts written in
either all uppercase or all lowercase. It is certainly true that for most
Germans it is easier to read proper capitalized texts. AFAICT there are
only some cases in which missing capitalization would make German
incomprehensible.

>> Even if they know it well enough - many
>> chatters in IRC write completely without capitalization because they can
>> type faster this way.
>
> They also write `n8' instead of `Good night!'. I hope you are not
> proposing to eliminate all redundancy in our language :-)

No - not really ;-)
I hope it did not sound like that. Redunancy is not bad - it is inherently
important for communication. Removing all redundancy would have
catastrophic effects.

Torsten

unread,
Mar 26, 2002, 5:31:45 PM3/26/02
to
Nils Goesche <car...@cartan.de> skrev:

> Compare ``Der Philosoph fuehlt sich im allgemeinen wohl.'' with
> ``Der Philosoph fuehlt sich im Allgemeinen wohl.'' In speech,
> you can tell the difference because in the latter case the
> main accent is on ``Allgemeinen'', whereas in the former it
> is on ``wohl''. Incidentally, the totally moronic ``spelling
> reform'' that happened a few years ago breaks this example,
> like numerous others, but fortunately at least my favorite
> newspaper continues to use the old spelling.

Written Danish used to have similar capitalization rules as
German, but that was changed in a spelling reform in 1948. It
took quite some time, about twenty years in fact, before the
last newspaper had started using all the things introduced in
the reform (capitalization wasn't the only change; a new letter
was added to the alphabet as well). There was no shortage of
arguments similar to the one you presented above. Even claims to
the effect that the language had been ruined. You don't hear them
much anymore, if at all. Most people quickly found out that the
arguments put forth in defense of the old system were strawmen.
The capitalization of nouns really did turn out to be just an
incidental accident of history that served no real purpose beyond
the purely aesthetic.

--
Torsten

Torsten

unread,
Mar 26, 2002, 5:38:47 PM3/26/02
to
Thomas Bushnell, BSG <tb+u...@becket.net> skrev:

> Oh, ok. That's a good point; the term "phoneme" is ambiguous I think.
> Tonal differences are sometimes phonemic and sometimes not, but I now
> understand what you mean. Whether a tonal or length difference should
> be officially phonemic is a matter style and not any real linguistics,
> as far as I can tell.

Tone is phonemic in Norwegian and Swedish. Tonal differences can
be used to form minimal pairs in those languages, as Erik has
already shown earlier.

--
Torsten

Nils Goesche

unread,
Mar 26, 2002, 5:45:19 PM3/26/02
to
Torsten <vi...@fraqz.archeron.dk> writes:

> The capitalization of nouns really did turn out to be just an
> incidental accident of history that served no real purpose beyond
> the purely aesthetic.

I don't know if it is of any use in Danish, I don't speak Danish.
But I have /read/ texts that didn't use capitalization in German,
and it was very annoying in that it just makes harder to guess
how a sentence is likely to end, something that is very important
in German (the verb at the end...). You don't grasp the
structure of a sentence as easily. Sure, that doesn't mean
capitalization is /necessary/, except for cases like the one I
posted before; but if it simply makes reading a bit easier, I
don't want to miss it. This has been measured, BTW.

Regards,
--
Nils Goesche
Ask not for whom the <CONTROL-G> tolls.

PGP key ID 0xC66D6E6F

Russell Senior

unread,
Mar 26, 2002, 5:36:03 PM3/26/02
to
>>>>> "Erik" == Erik Naggum <er...@naggum.net> writes:

TB> It's a gray area, not a rigid definition, and I thank you for
TB> pointing | out the complexities in the case of Norwegian.

Erik> Bare hyggelig!

Erik, I have _no_ interest in seeing your bare "hyggelig",
particularly if it is a "gray area". You people disgust me! ;-)


--
Russell Senior ``The two chiefs turned to each other.
sen...@aracnet.com Bellison uncorked a flood of horrible
profanity, which, translated meant, `This is
extremely unusual.' ''

Thien-Thi Nguyen

unread,
Mar 26, 2002, 6:54:50 PM3/26/02
to
Erik Naggum <er...@naggum.net> writes:

> No, it is not a grey area. It just does not apply to English. Study
> Norwegian or Thai.

vietnamese is a good example to study for those not familiar w/ this kind of
language feature because the representation of the tones is explicit (in the
accents).

word play in vietnamese often involves varying these tones.

~ / / ^ \
ca ca co co ca (approximately, some markings omitted!)
.

(each fish has an uncle tomato.) the ~ means make your voice kind of swirly,
the / means make it go higher, \ means make it go lower and . under the vowel
means make your voice go really low (there is also ? which makes your voice go
higher in a sort of question-like way). the name of each accent uses the
accent, which gives some insight on how spelling is taught: first you say the
unemphasized constituent phonemes then you say the accent; repeat, eliding the
naming of the accent by using it. see sesame street for (weird to me) english
adaptation...

this representation was introduced by the french during colonial times and
reflects some french cultural values (rationality, consistency). viet-nam has
a very high literacy rate due to this, i've been told.

thi

Larry Clapp

unread,
Mar 26, 2002, 8:04:49 PM3/26/02
to
In article <87wuvzp...@becket.becket.net>, Thomas Bushnell, BSG wrote:
> Since a phoneme is a minimal unit distinguishing two words, if there are two
> words that differ only in tone, the difference must therefore be phonemic.

Could one classify a toneme as a subclass of phonemes? More to the point, do
linguists?

-- L

Thomas Bushnell, BSG

unread,
Mar 26, 2002, 8:27:48 PM3/26/02
to
Larry Clapp <la...@theclapp.org> writes:

I don't know; the word "toneme" isn't in any dictionaries I had ready
access to when I checked. The text I learned what linguistics I know
from only has phonemes, and mentions tonal differences as one kind of
phonemic separation.

Thomas

Rahul Jain

unread,
Mar 26, 2002, 10:01:34 PM3/26/02
to
Larry Clapp <la...@theclapp.org> writes:

> Could one classify a toneme as a subclass of phonemes? More to the point, do
> linguists?

I think the main problem is that a toneme can span multiple phonemes,
so a toneme cannot necessarily be describled as a type of phoneme.

--
-> -/ - Rahul Jain - \- <-
-> -\ http://linux.rice.edu/~rahul -=- mailto:rj...@techie.com /- <-
-> -/ "Structure is nothing if it is all you got. Skeletons spook \- <-
-> -\ people if [they] try to walk around on their own. I really /- <-
-> -/ wonder why XML does not." -- Erik Naggum, comp.lang.lisp \- <-
|--|--------|--------------|----|-------------|------|---------|-----|-|
(c)1996-2002, All rights reserved. Disclaimer available upon request.

Kenny Tilton

unread,
Mar 26, 2002, 11:27:00 PM3/26/02
to

Ed L Cashin wrote:
>
> I'd be happy to hear
> a good case for case-insensitive identifiers.

I've done a ton of case-sensitive C and I've done a ton of code in
case-insensitive languages. I like case-insensitive much, much more.
Does that count?

A deeper reason is that it seems weird to use case to differentiate two
things. If I looked down and saw an app with two functions, say, ABLE-P
and able-p, meaning different things which the case was meant to convey,
I would have regretably ungenerous thoughts regarding the author.

--

kenny tilton
clinisys, inc
---------------------------------------------------------------
"Harvey has overcome not only time and space but any objections."
Elwood P. Dowd

Thomas Bushnell, BSG

unread,
Mar 27, 2002, 12:49:09 AM3/27/02
to
Rahul Jain <rj...@sid-1129.sid.rice.edu> writes:

> Larry Clapp <la...@theclapp.org> writes:
>
> > Could one classify a toneme as a subclass of phonemes? More to
> > the point, do linguists?
>
> I think the main problem is that a toneme can span multiple phonemes,
> so a toneme cannot necessarily be describled as a type of phoneme.

Yeah, I think this is central to the problem.

Does a toneme in Norwegian extend past a single syllable, however? I
don't know the answer to that question.

In Classical Attic Greek, the accents marked tone as well as stress,
and were (mostly) phonemic. But the accents were always marked over a
single vowel, and so you could distinguish not only the long eta and
the short epsilon, but each could have one of three different accents;
all distinctions which are basically not phonemic in English (though
we do have all those sounds).

The tones actually extend beyond just the vowel, and affect timing and
intonation of the whole word, however. But they are assigned to the
stressed vowel only, and are counted as various phonemic variants of
that vowel.

The situation might work out similarly in Norwegian, dunno.

Thomas


Torsten

unread,
Mar 27, 2002, 9:34:47 AM3/27/02
to
Nils Goesche <n...@cartan.de> skrev:

> But I have /read/ texts that didn't use capitalization in
> German, and it was very annoying in that it just makes harder
> to guess how a sentence is likely to end, something that is

> very important in German (the verb at the end...). [...] This
> has been measured, BTW.

I hope you can see the obvious flaw in such measurements. There
is no large German speaking group not trained to capitalize
nouns.

--
Torsten

Holger Schauer

unread,
Mar 27, 2002, 10:21:19 AM3/27/02
to
On 25 Mar 2002, Dorai Sitaram wrote:
> Andreas Eder <Andrea...@t-online.de> wrote:

>>Kent M Pitman <pit...@world.std.com> writes:
>>
>>> Capitalization _is_ incidental. It is ceremonially marked in
>>> written text, but my impression based on a basic knowledge of
>>> linguistics and a casual outside view of German [I don't purport
>>> to speak the langauge] is that German people may claim that "weg"
>>> and "Weg" are different words, but the capitalization is not
>>> pronounced audibly, so there is generally enough contextual
>>> information to disambiguate in speech.

I think you are confusing two seperate issues: written and spoken
language. If we assume that language very likely has not started from
words, and that humans seem to be somehow equipped with some kind of
language »module«, it seems natural that how something is spoken and
written is quite often different. As an example, consider the normal
alphabet we're using here which certainly does not even reflect how
something is pronounced; which is why linguists came up with the
phonetic alphabet, after all. On the contrary, (grasping) written
language may have also some aspects that are unique to the fact that
it is /written/. Whitespace comes to mind which was introduced to help
identify word and sentence boundaries. That such things do matter
should be obvious how has ever read a large and poorly type set
document. I see the matter of capitalization on the same level: it may
help you when you do have to distinguish ambigious cases.

>>Well, in fact 'Weg' and 'weg' *are* pronounced differently, one with
>>a long 'e' and the other with a short one - that is because they are
>>different words.

But Kent is surely right in saying that indeed the capitalization is
usually not pronounced.

> To me, that case is indeed ornamental is supported by
> the fact that it appears to be permissible to
> upper-case a German sentence in its entirety
> without construing it as a loss of information.
>

> BITTE EIN BIT
> ICH BIN EIN BERLINER
> DIE MAUER MUSS WEG!

Capitalization is a tool for disambiguating ambigious words. One tool.
If your using an all caps font, you can't give that information. So
you'll probably try to avoid ambiguous cases, which are rare, btw.
Others have already posted examples in which some confusion might be
avoided by looking at case. As already has been posted by somebody
else, the spelling reform, while fixing some irritating rules,
resulted in the introduction of many more ambigious cases (most of
them not case related, though).

However, why German capitalization rules are how they are, is beyond
me (I can live with them, for sure). As (simplifying) only nouns are
capitalized, it seems like capitalization should help in getting fast
to the right grammatical categorization, helping the syntax process.
But there are only rarely cases in which the fact that something is a
noun and not, say, a verb is really problematic, there is typical
enough contextually provided grammatical and semantic information.
Actually I think, English would be much more in need of noun
capitalization with its often lax handling of embedded sentences ("The
horse raced past the barn fell" gives me the headaches). A lot more
problematic issues arise from homophones, (e.g. the bank you sit on
vs. the bank you give your money to) which is not at all addressed by
noun capitalization.

Holger

--
--- http://www.coling.uni-freiburg.de/~schauer/ ---
"In Scheme, as in C, every programmer has to be a genius, but often comes
out a fool because he is so far from competent at every task required."
-- Erik Naggum in comp.lang.lisp

Erik Naggum

unread,
Mar 27, 2002, 12:15:50 PM3/27/02
to
* Erik Naggum
| Bare hyggelig!

* Russell Senior


| Erik, I have _no_ interest in seeing your bare "hyggelig",
| particularly if it is a "gray area". You people disgust me! ;-)

Ah, at last an explanation for why so many foreigners think we kind and
gentle Norwegians are so rude.

It is loading more messages.
0 new messages