Waider> On July 5, Kai.Gro...@CS.Uni-Dortmund.DE said:
>> On 05 Jul 2001, Daniel Pittman wrote:
>>
>> > IIRC, `ß' becomes `ss' when you change it's case.
>>
>> That's right.
>>
>> kai
Waider> So do I take it then that the conversion of `ß' becomes `ss'
Waider> is the Right Thing, or what?
I think it is not. The Yugoslav lj (Unicode 01C9;LATIN SMALL LETTER
LJ) becomes LJ in the upper case (01C7;LATIN CAPITAL LETTER LJ) and Lj
in the title case (01C8;LATIN CAPITAL LETTER L WITH SMALL LETTER J);
in this way the software can correctly circle through the cases:
ljubljana -- LJUBLJANA -- Ljubljana; similarly with the Dutch ij/IJ.
It's very strange that the Germans with their special relation to the
standards did not care to reserve a LATIN CAPITAL LETTER SHARP S for
the cases like this, where
title_case(up_case("Großjohann")) != "Großjohann"
--
Sergei
Writing in all capital letters seems mostly a US habit spreading
throughout the world. Same for "title case". Germans aware of
their typographical tradition never would write anything in
all capital letters. The upper/lower case distinction carries
significant information for German readers that processing
functions must not destroy by changing capitalization. Germans
have nicer ways to make things appear b o l d in typewriter
text than ugly CAPITALIZATION. German sharp s can never
appear at the beginning of a word or syllable. People who ask
for a capital sharp s haven't understood the applications
requirement and try to extended text processing functions
needed primarily only for English onto German and other languages.
If you really have to capitalise a German word, it is actually
quite acceptable to leave the sharp s untouched if preservation
of information is of concern. The only exceptions that I can think
of are standardised documents like passports, where international
standards for the OCR data require names to be written in all
capitals with a very restricted character set. That's where
AE, OE, UE, SS are used in German names.
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
Markus> Sergei Pokrovsky <p...@nbsp.nsk.su> writes:
>> It's very strange that the Germans with their special relation to
>> the standards did not care to reserve a LATIN CAPITAL LETTER
>> SHARP S for the cases like this, where
Markus> Writing in all capital letters seems mostly a US habit
Markus> spreading throughout the world. Same for "title case".
Markus> Germans aware of their typographical tradition never would
Markus> write anything in all capital letters.
Hm, I've got quite a few German books published in Germany where the
titles (the section titles, the cover and especially Schmutztitel etc)
are in all capitals. If you compile a table of content ---
Besides, there may be a German article in a US proceedings etc.
--
Sergei
There certainly is all-uppercase German text, but maybe it's not as
common as in English, especially for longer texts.
I think the main points are:
1. It's never really safe to automatically change the case of German
text.
2. Even if you can see it sometimes, using sharp s in uppercase text
is bad typography.
3. The same goes for rendering umlauts as "ae", "oe", etc.
4. It's especially rude to treat personal names like this (in any
language).
--
Michael Piotrowski, M.A. <m...@dynalabs.de>
MP> Sergei Pokrovsky <p...@nbsp.nsk.su> writes:
>> >>>>> Markus Kuhn writes:
>>
Markus> Sergei Pokrovsky <p...@nbsp.nsk.su> writes:
>> >> It's very strange that the Germans with their special relation
>> to >> the standards did not care to reserve a LATIN CAPITAL
>> LETTER >> SHARP S for the cases like this, where
>>
Markus> Writing in all capital letters seems mostly a US habit
Markus> spreading throughout the world. Same for "title case".
Markus> Germans aware of their typographical tradition never would
Markus> write anything in all capital letters.
>> Hm, I've got quite a few German books published in Germany where
>> the titles (the section titles, the cover and especially
>> Schmutztitel etc) are in all capitals. If you compile a table of
>> content ---
>>
>> Besides, there may be a German article in a US proceedings etc.
MP> There certainly is all-uppercase German text, but maybe it's not
MP> as common as in English, especially for longer texts.
Right. But there are special occasions, like name lists, where the
family names may be written in caps or small caps, in order to
distinguish them from the given names (which is especially important
if there Hungarian or Japanese participants).
MP> I think the main points are:
MP> 1. It's never really safe to automatically change the case of
MP> German text.
But often programs have to uniquify some words, usually this involves
unifying the case.
MP> 2. Even if you can see it sometimes, using sharp s in uppercase
MP> text is bad typography.
I've seen some fonts with a "capital sharp s" whose glyph had the form
SS. That is, graphically Maße would upcase to a 4-character MASSE, and
titlecasing it would produce Maße again, unlike the 5-character MASSE,
which would titlecase to Masse. It is strange that Unicode did not
follow that.
MP> 3. The same goes for rendering umlauts as "ae", "oe", etc.
These normally retain their umlauts; not a similar case as there are
capital forms for Ä, Ö etc.
MP> 4. It's especially rude to treat personal names like this (in
MP> any language).
That was the reason of my first posting: there was a personal name
mutilated by an RFC822 e-mail address.
--
Sergei
Yes, but other solutions might be safer, e.g., underlining or using
boldface. Since case-mapping is language-dependent, it's hard to do
it correctly for a list of names in various languages.
> MP> I think the main points are:
>
> MP> 1. It's never really safe to automatically change the case of
> MP> German text.
>
> But often programs have to uniquify some words, usually this involves
> unifying the case.
Sure, but as always with natural language processing, the results
might not meet the expectations of the users.
> MP> 2. Even if you can see it sometimes, using sharp s in uppercase
> MP> text is bad typography.
>
> I've seen some fonts with a "capital sharp s" whose glyph had the form
> SS. That is, graphically Maße would upcase to a 4-character MASSE, and
> titlecasing it would produce Maße again, unlike the 5-character MASSE,
> which would titlecase to Masse. It is strange that Unicode did not
> follow that.
That's the approach taken by the TeX T1 encoding, and it certainly has
something to it, and I agree that it might be good to have it in
Unicode. Since TeX is in widespread use, it might actually be
possible to get it into Unicode.
But then, in Swiss-German orthography there's no sharp s at all...
> MP> 3. The same goes for rendering umlauts as "ae", "oe", etc.
>
> These normally retain their umlauts; not a similar case as there are
> capital forms for Ä, Ö etc.
Yes, this point was not related to case, but intended as an example of
language-insensitive mangling of text.
> MP> 4. It's especially rude to treat personal names like this (in
> MP> any language).
>
> That was the reason of my first posting: there was a personal name
> mutilated by an RFC822 e-mail address.
I missed that, but it's good to see that my reply was not totally
off-topic ;-)
Although my name is often misspelled by humans, I'm lucky that it's
easy to process even for legacy computer systems...
> I've seen some fonts with a "capital sharp s" whose glyph had the
> form SS. That is, graphically Maße would upcase to a 4-character
> MASSE, and titlecasing it would produce Maße again, unlike the
> 5-character MASSE, which would titlecase to Masse. It is strange
> that Unicode did not follow that.
I don't think that fonts themselves in the usual standard formats
(except maybe for recent OpenType stuff) normally "know" how to
capitalize ß; but it has been true for a long time that in Adobe
small caps Type 1 postscript fonts, the definition of the /germandbls
character is as a glyph which graphically renders as two capital
letters "S" side by side. Of course, in regular (non-small-caps)
Type 1 fonts, there traditionally hasn't been any character or glyph
defined to express capital ß.
--%!PS
10 10 scale/M{rmoveto}def/R{rlineto}def 12 45 moveto 0 5 R 4 -1 M 5.5 0 R
currentpoint 3 sub 3 90 0 arcn 0 -6 R 7.54 10.28 M 2.7067 -9.28 R -5.6333
2 setlinewidth 0 R 9.8867 8 M 7 0 R 0 -9 R -6 4 M 0 -4 R stroke showpage
% Henry Churchyard chu...@usa.net http://www.crossmyt.com/hc/
Actually, it's not. It's an example of transliteration, and is highly
language-dependent, both in the "from" and the "to" language sets. It
gets a bit confusing because both the from and the to languages are
using Latin script, but with different augmentation sets.
For example, if the source language is French, a diaresis should be
dropped, whereas if it is German, a diaeresis becomes -e. In Swedish,
both techniques are widely used; historically the -e version was used,
but modern Swedes tend to prefer to transliterate "visually".
This isn't really any different than the fact that my first name is
spelled "pe - te ru" (ペーテル) and not "pi - ta"
(ピータ) in Katakana, even though to someone who knows
English and Japanese would think the latter transliteration would be
the appropriate one; or that the site of the last President of the
USSR was known as "Gorbachev" in English and "Gorbatjov" in Swedish.
-hpa
--
<h...@transmeta.com> at work, <h...@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
> Followup to: <x6hewjk...@eos.mii.dynalabs.de>
> By author: Michael Piotrowski <m...@dynalabs.de>
> In newsgroup: comp.std.internat
> >
> > > MP> 3. The same goes for rendering umlauts as "ae", "oe", etc.
> > >
> > > These normally retain their umlauts; not a similar case as there are
> > > capital forms for Ä, Ö, etc.
> >
> > Yes, this point was not related to case, but intended as an example of
> > language-insensitive mangling of text.
>
> Actually, it's not. It's an example of transliteration, and is highly
> language-dependent, both in the "from" and the "to" language sets. It
> gets a bit confusing because both the from and the to languages are
> using Latin script, but with different augmentation sets.
True, the conversion of "ä" to "ae" _is_ language-dependent. What I
really ment, is that such transliterations should be avoided if
possible; IMHO there's little reason for mutilating texts like this.
>In article <uu4rsk9...@nbsp.nsk.su>,
>Sergei Pokrovsky <p...@nbsp.nsk.su> wrote:
>
>> I've seen some fonts with a "capital sharp s" whose glyph had the
>> form SS. That is, graphically Maße would upcase to a 4-character
>> MASSE, and titlecasing it would produce Maße again, unlike the
>> 5-character MASSE, which would titlecase to Masse. It is strange
>> that Unicode did not follow that.
>
>I don't think that fonts themselves in the usual standard formats
>(except maybe for recent OpenType stuff) normally "know" how to
>capitalize ß; but it has been true for a long time that in Adobe
>small caps Type 1 postscript fonts, the definition of the /germandbls
>character is as a glyph which graphically renders as two capital
>letters "S" side by side. Of course, in regular (non-small-caps)
>Type 1 fonts, there traditionally hasn't been any character or glyph
>defined to express capital ß.
Adding separate unaccented capital glyph codes to fonts would be
a good way to handle invertible upper-/lowercasing for languages
such as Canadian French, where the accent marks do not appear on
capital letters.
Maybe it's time to move code specification in the direction of
simpler processing and away from purely a means of rendering
glyphs. I am sure some similar thought and effort applied to the
more graphical language glyph codes would simplify processing of
various sorts while perhaps occupying a larger set of code
points.
For example, maybe it is time to move away from ASCII based codes
and develop a Latin language code with all the accented
variations of all the letters adjacent within the alphabetical
sequence. There is no reason nowadays that, for example, the
Spanish double letters could not be assigned a single code,
displaying as two normal width letters, rather than being handled
as a sequence of two individual ASCII letters.
Thanks. Take care, Brian Inglis Calgary, Alberta, Canada
--
Brian....@CSi.com (Brian dot Inglis at SystematicSw dot ab dot ca)
fake address use address above to reply
tos...@aol.com ab...@aol.com ab...@yahoo.com ab...@hotmail.com ab...@msn.com ab...@sprint.com ab...@earthlink.com ab...@cadvision.com ab...@ibsystems.com u...@ftc.gov
spam traps
>> In article <uu4rsk9...@nbsp.nsk.su>,
>> Sergei Pokrovsky <p...@nbsp.nsk.su> wrote:
>>> I've seen some fonts with a "capital sharp s" whose glyph had the
>>> form SS. That is, graphically Maße would upcase to a 4-character
>>> MASSE, and titlecasing it would produce Maße again, unlike the
>>> 5-character MASSE, which would titlecase to Masse. It is strange
>>> that Unicode did not follow that.
>> I don't think that fonts themselves in the usual standard formats
>> (except maybe for recent OpenType stuff) normally "know" how to
>> capitalize ß; but it has been true for a long time that in Adobe
>> small caps Type 1 postscript fonts, the definition of the
>> /germandbls character is as a glyph which graphically renders as
>> two capital letters "S" side by side. Of course, in regular
>> (non-small-caps) Type 1 fonts, there traditionally hasn't been any
>> character or glyph defined to express capital ß.
> Adding separate unaccented capital glyph codes to fonts would be a
> good way to handle invertible upper-/lowercasing for languages such
> as Canadian French, where the accent marks do not appear on capital
> letters.
Don't know much about Quebec, but when I was in France, I observed
that there were pretty much three levels:
1) Not printing diacritics on any capital letters
2) Printing diacritics on capital É (E acute) only.
3) Printing diacritics on all capital letters.
I don't think these different variants should be encoded directly into
different character sets (though the old IBM PC Code Page 437
basically represented level 2 above, due to the limited number of code
positions allocated to diacritical letters).
> maybe it is time to move away from ASCII based codes and develop a
> Latin language code with all the accented variations of all the
> letters adjacent within the alphabetical sequence. There is no
> reason nowadays that, for example, the Spanish double letters could
> not be assigned a single code, displaying as two normal width
> letters, rather than being handled as a sequence of two individual
> ASCII letters.
This doesn't make too much sense to me; currently, numerous languages
can all be handled with the same ISO-8859-1 Latin 1 character set --
but these languages have different alphabetization practices, so that
if each language's alphabetization practices had to be represented
directly in a character code (including digraphs such as Spacnich CH
and LL), then each language would have to use a separate
language-specific character set. Doesn't sound like an advance to
me...
Not only does it not make much sense, it makes it much more difficult
to look up words. The American Llano would be something completely
different from the Spanish Llano. Also, ordering is not consistent.
How would you code (in Sweidish) the v and w that are sorted as if
they are identical? Also, where would you put the a-diaeresis, which
is sorted as identical with a in German, but as a separate letter
following z in Swedish and Finnish? Also Danish sorts a-diaeresis
(when it occurs in foreign words) as identical to the ae-ligature,
which precedes the a-ring (and follows the z), in Swedish they both
occur, are sorted also after z, but in different order.
And I can go on... For instance in Dutch three different sorting orders
are actually used, and that only because of the combination "ij". In
dictionary order it is sorted as the letter i followed by j, in
phonebook order it is sorted as identical to y, in encyclopedia order
it is sorted as between x and y (except of course in those cases where
they actually *are* two letters that for some reason come together, as
in the loan-word bijoux, but there is according to the dictionary only
*one* native word where that does ocuur).
--
dik t. winter, cwi, kruislaan 413, 1098 sj amsterdam, nederland, +31205924131
home: bovenover 215, 1025 jn amsterdam, nederland; http://www.cwi.nl/~dik/
Dik> In article <9inqbh$s...@moe.cc.utexas.edu> "Henry Churchyard" <chu...@usa.net> writes:
>> In article <obbqkt08d91nhc957...@4ax.com>,
>> Brian Inglis <Brian.do...@SystematicSw.ab.ca> wrote:
Dik> ...
>> > maybe it is time to move away from ASCII based codes and develop a
>> > Latin language code with all the accented variations of all the
>> > letters adjacent within the alphabetical sequence. There is no
>> > reason nowadays that, for example, the Spanish double letters could
>> > not be assigned a single code, displaying as two normal width
>> > letters, rather than being handled as a sequence of two individual
>> > ASCII letters.
>>
>> This doesn't make too much sense to me;
Dik> Not only does it not make much sense, it makes it much more difficult
Dik> to look up words. The American Llano would be something completely
Dik> different from the Spanish Llano.
This only means that the search engine must be smart enough to unify
the character representations. A pdf search seems to be capable to
expand ligatures like fi or fl; and normally it can be told to fold
case (a very strange European whim to distinguish cases!).
Dik> Also, ordering is not consistent.
It cannot be consistent if various languages define different
collating sequences. The Czech _letter_ ch has to collate between h
and i:
Egypt < echo < ejhle
Dik> How would you code (in Sweidish) the v and w that are sorted as if
Dik> they are identical? Also, where would you put the a-diaeresis, which
Dik> is sorted as identical with a in German, but as a separate letter
Dik> following z in Swedish and Finnish? Also Danish sorts a-diaeresis
Dik> (when it occurs in foreign words) as identical to the ae-ligature,
Dik> which precedes the a-ring (and follows the z), in Swedish they both
Dik> occur, are sorted also after z, but in different order.
There are locales and language markup. After all, graphically A is
identical as a Greek Alpha (the original) or as its Cyrillic or Latin
imitations in these derived alphabets.
--
Sergei
I am well aware of the complexities of collation orders in
various languages, and was not suggesting that these be implied
by the coding in the language, rather that common shape or
pronunciation be used to determine which codes are placed
together, e.g. Spanish ch among the cees, German lower and upper
esszet be placed among the esses.
The suggestion I was trying to make was that, by shifting from
fixed (or nearly fixed proportional) width US ASCII letters as a
common base, to a common base code where all accented variants of
latin letters were available in a more visually or verbally
obvious order, and included national variants, digraphs and
ligatures that currently pose processing problems because of
representation as multiple ASCII letters, that might allow us to
think differently about the bases of character coding, and their
processing requirements.
> This doesn't make too much sense to me; currently, numerous languages
> can all be handled with the same ISO-8859-1 Latin 1 character set --
> but these languages have different alphabetization practices, so that
> if each language's alphabetization practices had to be represented
> directly in a character code (including digraphs such as Spacnich CH
> and LL), then each language would have to use a separate
> language-specific character set. Doesn't sound like an advance to
> me...
Are you mixing up character set and character encoding?
> Also, where would you put the a-diaeresis, which
> is sorted as identical with a in German, but as a separate letter
> following z in Swedish and Finnish?
Sometimes 'ä' is also sorted as 'ae' ...
>> This doesn't make too much sense to me; currently, numerous
>> languages can all be handled with the same ISO-8859-1 Latin 1
>> character set -- but these languages have different alphabetization
>> practices, so that if each language's alphabetization practices had
>> to be represented directly in a character code (including digraphs
>> such as Spanish CH and LL), then each language would have to use a
>> separate language-specific character set. Doesn't sound like an
>> advance to me...
> Are you mixing up character set and character encoding?
It don't make much difference (in this particular context); if CH and
LL are treated as separate unitary characters in Spanish, then Spanish
would have to have a different language-specific character set (and
naturally the encoding of this character set would therefore be
different from any other language's encoded character set).
--%!PS
10 10 scale/M{rmoveto}def/R{rlineto}def 12 45 moveto 0 5 R 4 -1 M 5.5 0 R
currentpoint 3 sub 3 90 0 arcn 0 -6 R 7.54 10.28 M 2.7067 -9.28 R -5.6333
2 setlinewidth 0 R 9.8867 8 M 7 0 R 0 -9 R -6 4 M 0 -4 R stroke showpage
% NEW E-MAIL: chu...@crossmyt.com http://www.crossmyt.com/hc/
It's worse than that. Spain itself has apparently abandoned the
practice of treating "ch" and "ll" as individual letters, but some
other Spanish-speaking countries have not...
-hpa
--
<h...@transmeta.com> at work, <h...@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
There IS the problem of consistency... Austrian postage stamps have
sometimes had Oesterreich, and sometimes O with the umlaut, which I
cannot do on this keyboard.