Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Case-insensitive comparison of std::strings

22 views
Skip to first unread message

Ganesh

unread,
Jul 29, 2004, 6:27:48 AM7/29/04
to
It is a surprise to most of the "common" C++ programmers to learn that
std::string provides no simple way of doing case-insensitive
comparison. Before posting this, I referred to:

http://www.freshsources.com/bjarne/ALLISON.HTM
http://www.josuttis.com/libbook/string/icstring.hpp.html

Given that case-insensitive comparison is such a common operation,
shouldn't it be made available within C++ standard library instead of
leaving it to the programmers to re-write such commonly used
functionality?

-Ganesh

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Thorsten Ottosen

unread,
Jul 29, 2004, 11:15:11 AM7/29/04
to
"Ganesh" <sgga...@gmail.com> wrote in message news:619f36eb.04072...@posting.google.com...

|
| Given that case-insensitive comparison is such a common operation,
| shouldn't it be made available within C++ standard library instead of
| leaving it to the programmers to re-write such commonly used
| functionality?

yes. Keep an eye out for the next version of boost which will have a string library .

br

Thorsten

Thomas Maeder

unread,
Jul 29, 2004, 5:13:41 PM7/29/04
to
sgga...@gmail.com (Ganesh) writes:

> It is a surprise to most of the "common" C++ programmers to learn that
> std::string provides no simple way of doing case-insensitive
> comparison. Before posting this, I referred to:
>
> http://www.freshsources.com/bjarne/ALLISON.HTM
> http://www.josuttis.com/libbook/string/icstring.hpp.html
>
> Given that case-insensitive comparison is such a common operation,
> shouldn't it be made available within C++ standard library instead of
> leaving it to the programmers to re-write such commonly used
> functionality?

Please give an *exact* specification of what you understand by
"case insensitive comparison of std::strings". Take into consideration
that in German, "MASSE" and "Masse" should only compare equal if they both
mean "mass", but not if they mean "measures".

Oh, and that's only in some countries, such as Germany and Austria. Here in
Switzerland, they should always compare equal.

Vinayak Raghuvamshi

unread,
Jul 29, 2004, 5:20:31 PM7/29/04
to
sgga...@gmail.com (Ganesh) wrote in message news:<619f36eb.04072...@posting.google.com>...

> It is a surprise to most of the "common" C++ programmers to learn that
> std::string provides no simple way of doing case-insensitive
> comparison.

Well, isn't everything case sensitive in C++? so why surprised at
strings being treated in case sensitive manner? :-)

STL is kind of saying "hey, strings and everything else are case
sensitive in C++, but you can replace any of my methods with your own
in a pluggable manner...". I think it is fair enough...

Simple way of doing case-insensitive comparison?

stricmp(dest.c_str(),src.c_str());

Sorry, I know my response doesn't help much, and I wish I had a better
answer....

-Vinayak

> Before posting this, I referred to:
>
> http://www.freshsources.com/bjarne/ALLISON.HTM
> http://www.josuttis.com/libbook/string/icstring.hpp.html
>
> Given that case-insensitive comparison is such a common operation,
> shouldn't it be made available within C++ standard library instead of
> leaving it to the programmers to re-write such commonly used
> functionality?
>

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Roshan

unread,
Jul 30, 2004, 5:27:09 AM7/30/04
to
"Ganesh" <sgga...@gmail.com> wrote

> | Given that case-insensitive comparison is such a common operation,
> | shouldn't it be made available within C++ standard library instead of
> | leaving it to the programmers to re-write such commonly used
> | functionality?

std::string may not, but std::basic_string<> should allow for that i think
In a sense there are multiple ways to sort strings...ascii... ebcdic...lexicographical....

You need to define a custom character trait that implements the correct compare( ) function.
It _may_ be pretty easy to write a custom char trait that inherits from char_traits<char> and simply
overrides
the comapre( ) function. Then use your my_char_trait as follows

typedef std::basic_string<char, my_char_trait> insensitive_string;

-Roshan

Julie

unread,
Jul 30, 2004, 5:34:49 AM7/30/04
to
Vinayak Raghuvamshi wrote:
>
> sgga...@gmail.com (Ganesh) wrote in message news:<619f36eb.04072...@posting.google.com>...
> > It is a surprise to most of the "common" C++ programmers to learn that
> > std::string provides no simple way of doing case-insensitive
> > comparison.
>
> Well, isn't everything case sensitive in C++? so why surprised at
> strings being treated in case sensitive manner? :-)
>
> STL is kind of saying "hey, strings and everything else are case
> sensitive in C++, but you can replace any of my methods with your own
> in a pluggable manner...". I think it is fair enough...
>
> Simple way of doing case-insensitive comparison?
>
> stricmp(dest.c_str(),src.c_str());
>
> Sorry, I know my response doesn't help much, and I wish I had a better
> answer....

Yes, bad answer.

C++, the *language* is case sensitive, but strings/character arrays typically
represent real-world words, which are by default, not case sensitive. Mixing
the two completely separate notions is flawed.

Julie

unread,
Jul 30, 2004, 5:35:19 AM7/30/04
to
Thomas Maeder wrote:
>
> sgga...@gmail.com (Ganesh) writes:
>
> > It is a surprise to most of the "common" C++ programmers to learn that
> > std::string provides no simple way of doing case-insensitive
> > comparison. Before posting this, I referred to:
> >
> > http://www.freshsources.com/bjarne/ALLISON.HTM
> > http://www.josuttis.com/libbook/string/icstring.hpp.html
> >
> > Given that case-insensitive comparison is such a common operation,
> > shouldn't it be made available within C++ standard library instead of
> > leaving it to the programmers to re-write such commonly used
> > functionality?
>
> Please give an *exact* specification of what you understand by
> "case insensitive comparison of std::strings". Take into consideration
> that in German, "MASSE" and "Masse" should only compare equal if they both
> mean "mass", but not if they mean "measures".
>
> Oh, and that's only in some countries, such as Germany and Austria. Here in
> Switzerland, they should always compare equal.

a - lower case 'A'
A - upper case 'A'

case-insensitive comparison - a == A

Remember, this operates on _characters_, not words. It doesn't matter if MASSE
and Masse are considered different words in different countries -- in that
case, you wouldn't do a case insensitive comparison. For those countries where
case doesn't determine the word, case insensitive comparisons would be
appropriate.

All of this, however, is at the *option* of the programmer. Right now, there
isn't an intrinsic way way to compare std::string in a case-insensitive way.
Having that capability would be beneficial, boost offerings aside and
localities aside.

Zev_K

unread,
Jul 30, 2004, 5:49:42 AM7/30/04
to
vs_ragh...@hotmail.com (Vinayak Raghuvamshi) wrote in message news:<9afa978c.04072...@posting.google.com>...

> sgga...@gmail.com (Ganesh) wrote in message news:<619f36eb.04072...@posting.google.com>...
> > It is a surprise to most of the "common" C++ programmers to learn that
> > std::string provides no simple way of doing case-insensitive
> > comparison.
>
> Well, isn't everything case sensitive in C++? so why surprised at
> strings being treated in case sensitive manner? :-)
>
> STL is kind of saying "hey, strings and everything else are case
> sensitive in C++, but you can replace any of my methods with your own
> in a pluggable manner...". I think it is fair enough...
>
> Simple way of doing case-insensitive comparison?
>
> stricmp(dest.c_str(),src.c_str());
>
> Sorry, I know my response doesn't help much, and I wish I had a better
> answer....
>
> -Vinayak

When I need to do case insensitive comparisons using strings, and I
dont want to have to resort to C methods, I usually do something to
the effect of:
s1.toLower()==s2.toLower()

However, in most cases, it just pays to store everything as either
upper or lower case, making everything simpler.

Francis Glassborow

unread,
Jul 30, 2004, 11:27:42 AM7/30/04
to
In article <410987CF...@nospam.com>, Julie <ju...@nospam.com>
writes

>a - lower case 'A'
>A - upper case 'A'
>
>case-insensitive comparison - a == A
>
>Remember, this operates on _characters_, not words.

Fine, but how should accented lower case letter compare to unaccented
uppercase ones? Please note that we use accents and other diacriticals
in British English but generally only on lowercase letters.

The idea that there is a (natural) language universal concept of case
sensitivity is simplistic. For example, how should we handle the German
double s represented by a glyph that looks like beta.

Case, along with collation order is not a property of letters but of a
specific use of a natural language. We should not give some elevated
status to (US) English other than that it already has by effectively
being the default C and C++ locale. And in those contexts, case
sensitivity reigns.


--
Francis Glassborow ACCU
Author of 'You Can Do It!' see http://www.spellen.org/youcandoit
For project ideas and contributions: http://www.spellen.org/youcandoit/projects

"Daniel Krügler (ne Spangenberg)"

unread,
Jul 30, 2004, 11:34:54 AM7/30/04
to
Hello Julie.

Julie schrieb:

I don't think that you can ignore locales for any proper
case-insentitive comparison which acts on general
strings (and not of special constrained strings, which might be limited
to some special code set). I can
say that because I once did the same error (and I actually I **should**
have known it due to my national
origin...).
Consider languages (e.g. German) which don't have a unique
character-by-character mapping (e.g. sz, which
is the character ß in my code page, ands maps to ss). Additionally there
exist circumstances where an umlaut
can validly compared by a two-character-representation (e.g. ü -> ue).
So I don't think, that the C++
standard should provide any half-baked solution.

If have a limited on special character codes you can write a quite
general solution by writing a special char_traits
class and use this traits class in the std::basic_string<> class
template. Have a look

http://www.gotw.ca/gotw/029.htm

to see what I mean in detail.

Greetings from Bremen,

Daniel

Vinayak Raghuvamshi

unread,
Jul 30, 2004, 12:01:50 PM7/30/04
to
Julie <ju...@nospam.com> wrote in message news:<41098654...@nospam.com>...

> Vinayak Raghuvamshi wrote:
> >
> > sgga...@gmail.com (Ganesh) wrote in message news:<619f36eb.04072...@posting.google.com>...
> > > It is a surprise to most of the "common" C++ programmers to learn that
> > > std::string provides no simple way of doing case-insensitive
> > > comparison.
> >
> > Well, isn't everything case sensitive in C++? so why surprised at
> > strings being treated in case sensitive manner? :-)
> >
> > STL is kind of saying "hey, strings and everything else are case
> > sensitive in C++, but you can replace any of my methods with your own
> > in a pluggable manner...". I think it is fair enough...
> >
> > Simple way of doing case-insensitive comparison?
> >
> > stricmp(dest.c_str(),src.c_str());
> >
> > Sorry, I know my response doesn't help much, and I wish I had a better
> > answer....
>
> Yes, bad answer.
>
> C++, the *language* is case sensitive, but strings/character arrays typically
> represent real-world words, which are by default, not case sensitive. Mixing
> the two completely separate notions is flawed.

Depends on what your notion of "real-world words" is...
The file systems of Most OSes are case sensitive.
Usernames/Passwords Used by Almost All systems are case sensitive.

As a developer, it actually helps to work in an environment that keeps
reminding you that the whole world is not case in-sensitive.

I agree that I could not provide a good "solution" to the original
poster, but nevertheless, I dO BElieVE THat MoSt rEAl-woRLd
apPlIcAtIoNS arE caSe sEnSItIve....

-Vinayak

tom_usenet

unread,
Jul 30, 2004, 11:01:41 PM7/30/04
to
On 29 Jul 2004 17:20:31 -0400, vs_ragh...@hotmail.com (Vinayak
Raghuvamshi) wrote:

>sgga...@gmail.com (Ganesh) wrote in message news:<619f36eb.04072...@posting.google.com>...
>> It is a surprise to most of the "common" C++ programmers to learn that
>> std::string provides no simple way of doing case-insensitive
>> comparison.
>
>Well, isn't everything case sensitive in C++? so why surprised at
>strings being treated in case sensitive manner? :-)
>
>STL is kind of saying "hey, strings and everything else are case
>sensitive in C++, but you can replace any of my methods with your own
>in a pluggable manner...". I think it is fair enough...
>
>Simple way of doing case-insensitive comparison?
>
>stricmp(dest.c_str(),src.c_str());
>
>Sorry, I know my response doesn't help much, and I wish I had a better
>answer....

stricmp is a non-standard function - you can't use it in portable
code.

Tom

Jeff Flinn

unread,
Jul 30, 2004, 11:05:01 PM7/30/04
to

"Vinayak Raghuvamshi" <vs_ragh...@hotmail.com> wrote in message
news:9afa978c.04073...@posting.google.com...

> Julie <ju...@nospam.com> wrote in message
news:<41098654...@nospam.com>...
> > Vinayak Raghuvamshi wrote:
> > >
> > > sgga...@gmail.com (Ganesh) wrote in message
news:<619f36eb.04072...@posting.google.com>...

> >


> > C++, the *language* is case sensitive, but strings/character arrays
typically
> > represent real-world words, which are by default, not case sensitive.
Mixing
> > the two completely separate notions is flawed.
>
> Depends on what your notion of "real-world words" is...
> The file systems of Most OSes are case sensitive.

Most OSes does not equate to most systems in use.

Jeff F

Julie

unread,
Jul 30, 2004, 11:13:50 PM7/30/04
to
"Daniel Krügler (ne Spangenberg)" wrote:
> I don't think that you can ignore locales for any proper
> case-insentitive comparison which acts on general
> strings (and not of special constrained strings, which might be limited
> to some special code set). I can
> say that because I once did the same error (and I actually I **should**
> have known it due to my national
> origin...).

You are absolutely correct, locales must be taken into consideration, if and
when case-insensitive comparators are provided. My previous comment about
'locales aside' was merely to discuss the value of case-insensitive
comparisons, excluding specifics, but may have been a little to unrestricted to
really convey my comments.

> Consider languages (e.g. German) which don't have a unique
> character-by-character mapping (e.g. sz, which
> is the character ß in my code page, ands maps to ss). Additionally there
> exist circumstances where an umlaut
> can validly compared by a two-character-representation (e.g. ü -> ue).
> So I don't think, that the C++
> standard should provide any half-baked solution.

Absolutely. A local-specific case-insensitive comparator may be far from
trivial to implement. In cases where it can't be implemented due to
locale-specific context issues, then that comparator is simply not available.
In those locales where it can be implemented, then it is provided. I don't
consider this half-baked, simply providing what _can_ be provided, rather than
an 'all or nothing' approach.

Julie

unread,
Jul 30, 2004, 11:14:20 PM7/30/04
to
Francis Glassborow wrote:
>
> In article <410987CF...@nospam.com>, Julie <ju...@nospam.com>
> writes
> >a - lower case 'A'
> >A - upper case 'A'
> >
> >case-insensitive comparison - a == A
> >
> >Remember, this operates on _characters_, not words.
>
> Fine, but how should accented lower case letter compare to unaccented
> uppercase ones?

You tell me!

How are accented letters compared in real life situations in a locale? Apply
that model to the language and create specific comparators that operate on the
current locale.

> The idea that there is a (natural) language universal concept of case
> sensitivity is simplistic. For example, how should we handle the German
> double s represented by a glyph that looks like beta.

Please provide more on how case sensitivity is simplistic? Do German keyboards
not have a shift key? or does the shift key not operate on the QWERTY portion
of the keyboard?

Case is all very well defined, per the keyboard -- use that model.

> Case, along with collation order is not a property of letters but of a
> specific use of a natural language. We should not give some elevated
> status to (US) English other than that it already has by effectively
> being the default C and C++ locale. And in those contexts, case
> sensitivity reigns.

I really don't know what your status comment has to do w/ anything pertaining
to this topic.

Nobody is advocating changes to the current behavior, simply adding to it to
provide support for (presumably locale-specific) case-insensitive comparisons.
If case-(in)sensitivity doesn't apply to a particular locale, then it isn't
provided. For those where it does, it is provided, available, and usable at
the discretion of the programmer.

Julie

unread,
Jul 30, 2004, 11:45:42 PM7/30/04
to
Vinayak Raghuvamshi wrote:
> As a developer, it actually helps to work in an environment that keeps
> reminding you that the whole world is not case in-sensitive.

You may want to get out of your cubicle and look around. Computers aside, look
around.

Are you confused when you read "MILK" on a carton rather than the more common
"Milk"? Presumably no.

Do you talk to others as:

"cap H - hello period cap W what's new?" Again, presumably no.

> I agree that I could not provide a good "solution" to the original
> poster, but nevertheless, I dO BElieVE THat MoSt rEAl-woRLd
> apPlIcAtIoNS arE caSe sEnSItIve....

If your statement were correct, then your eXaMPle wouldn't make any sense,
strictly because of case.

Look around, most real-world situations are case insensitive.

Finally, all hyperbole aside, what is the problem with _providing_ an intrinsic
mechanism to do case-insensitive comparisons?

Thomas Maeder

unread,
Jul 31, 2004, 9:51:32 AM7/31/04
to
Julie <ju...@nospam.com> writes:

>> Fine, but how should accented lower case letter compare to unaccented
>> uppercase ones?
>
> You tell me!

I'm not Francis, but if you ask me to tell you, then I think that the idea
that there is an (even locale specific) general correct way of treating
strings case-insensitively is wrong.


> How are accented letters compared in real life situations in a locale? Apply
> that model to the language and create specific comparators that operate on
> the current locale.

Fine. You have just contradicted yourself. :-)

In real life, "ss" and "SS" are compared in a context dependent way. You can
only do it correctly if you know the meaning of the word that they are part
of. And this is just an example.

Case-insensitivity on a character by character basis simply doesn't make any
sense in the scope of std::string.


>> The idea that there is a (natural) language universal concept of case
>> sensitivity is simplistic. For example, how should we handle the German
>> double s represented by a glyph that looks like beta.
>
> Please provide more on how case sensitivity is simplistic? Do German
> keyboards not have a shift key? or does the shift key not operate on the
> QWERTY portion of the keyboard?
>
> Case is all very well defined, per the keyboard -- use that model.

This is utter nonsense.

First, there are different varieties of German keyboards (and I think they all
are "QWERTZ", not "QWERTY"), with different behavior wrt case.

Second, the idea that an ISO Standard should be based on some keyboard layout
is really adventurous.


> Nobody is advocating changes to the current behavior, simply adding to it to
> provide support for (presumably locale-specific) case-insensitive
> comparisons.

[Repeating myself:] Locale specificity is not sufficient, understanding of
the text is required.

Ray Lischner

unread,
Jul 31, 2004, 2:45:50 PM7/31/04
to
On Saturday 31 July 2004 09:51 am, Thomas Maeder wrote:

> Locale specificity is not sufficient, understanding of
> the text is required.

I'm curious. How does word processing software perform a
case-insensitive search in German? I guess they detect incorrect
matches, and it is up to the user to decide what to do with the
results. Or do they try to interpret the text to find only correct
matches?
--
Ray Lischner, author of C++ in a Nutshell
http://www.tempest-sw.com/cpp

Thomas Maeder

unread,
Aug 1, 2004, 7:15:41 AM8/1/04
to
Ray Lischner <rl....@tempest-sw.com> writes:

>> Locale specificity is not sufficient, understanding of
>> the text is required.
>
> I'm curious. How does word processing software perform a
> case-insensitive search in German? I guess they detect incorrect
> matches, and it is up to the user to decide what to do with the
> results.

FWIW, I just created a Word document, entered "Maße", set the language of the
entire document to "German (Germany)" and did a (what Word calls)
case-insensitive search for "MASSE" (and "MASZE", to err on the safe side);
Word didn't find it. Same result for "MASSE" in the text and "Maße" in
the search argument.


> Or do they try to interpret the text to find only correct matches?

I wouldn't know of any software that could do this anywhere near correctly.

Julie

unread,
Aug 2, 2004, 11:14:51 AM8/2/04
to
Thomas Maeder wrote:
>
> Ray Lischner <rl....@tempest-sw.com> writes:
>
> >> Locale specificity is not sufficient, understanding of
> >> the text is required.
> >
> > I'm curious. How does word processing software perform a
> > case-insensitive search in German? I guess they detect incorrect
> > matches, and it is up to the user to decide what to do with the
> > results.
>
> FWIW, I just created a Word document, entered "Maße", set the language of the
> entire document to "German (Germany)" and did a (what Word calls)
> case-insensitive search for "MASSE" (and "MASZE", to err on the safe side);
> Word didn't find it. Same result for "MASSE" in the text and "Maße" in
> the search argument.

What happens if you enter in "Maße" and then Format/Change Case/lowercase?

Thomas Maeder

unread,
Aug 2, 2004, 6:48:47 PM8/2/04
to

[I have the feeling that this is getting off-topic.]

Julie <ju...@nospam.com> writes:

> Thomas Maeder wrote:
>>
> What happens if you enter in "Maße" and then Format/Change Case/lowercase?

MAßE, which seems very wrong to me.

Allan W

unread,
Aug 2, 2004, 6:53:00 PM8/2/04
to
Julie <ju...@nospam.com> wrote

> Absolutely. A local-specific case-insensitive comparator may be far from
> trivial to implement. In cases where it can't be implemented due to
> locale-specific context issues, then that comparator is simply not
> available. In those locales where it can be implemented, then it is
> provided. I don't consider this half-baked, simply providing what
> _can_ be provided, rather than an 'all or nothing' approach.

I hope the problems with this approach are apparent.

If the standard says that such a comparator *MAY* be made available
by an implementation, this implies that it might *NOT* be available.
Which means that your program can't assume that it exists on all
compliant platforms. Which means that your portable program can't
use it.

The workaround would be to have the standard specify a preprocessor
symbol that says if the comparator is available or not. Then your
program could use the library version if it is available, otherwise
it could roll it's own...

But if you're able to roll your own for the cases where it's needed,
why can't you just roll your own 100% of the time? It's actually
LESS work to do this (because you don't have to muck around with
preprocessor directives).

John Dibling

unread,
Aug 3, 2004, 7:22:14 AM8/3/04
to
sgga...@gmail.com (Ganesh) wrote in message news:<619f36eb.04072...@posting.google.com>...
> It is a surprise to most of the "common" C++ programmers to learn that
> std::string provides no simple way of doing case-insensitive
> comparison.

In fact, when you really sit down and look at the standard library as
a whole, you will find that there is a great deal that is "missing."
You are right, there is no built-in way to do a SI compare of
std::strings. But there is also no std::string version of sprintf(),
and I would argue that of all the string-related functions in the CRT,
sprintf() is (one of) the most-commonly used.

The library of "missing," functions goes far beyond sprintf(), and
even beyond string-related functions. For example, find() is provided
to find an element which compares equal to another element using
operator==. If operator== doesn't work for you, you can define what
it means to "be the same" youself in a functor, and use find_if()
instead of find(). That is, there are non-predicated and predicated
versions of find(). But there is no predicated version of copy(),
transform() or for_each(). It didn't occur to me for a long time that
there might be predicated versions of these algorithms. But when I
did realize it, and wrote them all myself in an STL extensions
library, they became invaluable.

There is also no copy_backward_if(), and even if there were, there is
also no bidirectional_back_insert_iterator to use with it. The list
goes on...

BTW - Scott Meyers covers CI compares of std::strings in "Effective
STL," item 35.

- John Dibling
jdib...@yahoo.com

llewelly

unread,
Aug 3, 2004, 7:37:32 AM8/3/04
to
Thomas Maeder <mae...@glue.ch> writes:

> Ray Lischner <rl....@tempest-sw.com> writes:
>
> >> Locale specificity is not sufficient, understanding of
> >> the text is required.
> >
> > I'm curious. How does word processing software perform a
> > case-insensitive search in German? I guess they detect incorrect
> > matches, and it is up to the user to decide what to do with the
> > results.
>
> FWIW, I just created a Word document, entered "Maße", set the language of the
> entire document to "German (Germany)" and did a (what Word calls)
> case-insensitive search for "MASSE" (and "MASZE", to err on the safe side);
> Word didn't find it. Same result for "MASSE" in the text and "Maße" in
> the search argument.

My question is: Do you think a German-speaker who was an ordinary
computer user, would find this behavior an unpleasant surprise?

Vinayak Raghuvamshi

unread,
Aug 3, 2004, 7:38:22 AM8/3/04
to
Julie <ju...@nospam.com> wrote in message news:<410A91E3...@nospam.com>...

> You may want to get out of your cubicle and look around. Computers aside, look
> around.

I just did. And I did not find any std::strings "out there.." :-)

> Are you confused when you read "MILK" on a carton rather than the more common
> "Milk"? Presumably no.

Well no. But I do consider it a bit odd when I see a sentence typed as
plEase dRink mILk, rELax aND gEt A lIFE.....

As some one rightly said in a reply to your other comments, case
sensitiveness or insensitiveness depends on the context.

> Do you talk to others as:
> "cap H - hello period cap W what's new?" Again, presumably no.

Well no. And I do not use a std::string to "talk" to others. I am sure
you write "cap H - hello period cap W what's new?", though...

>
> > I agree that I could not provide a good "solution" to the original
> > poster, but nevertheless, I dO BElieVE THat MoSt rEAl-woRLd
> > apPlIcAtIoNS arE caSe sEnSItIve....
>
> If your statement were correct, then your eXaMPle wouldn't make any sense,
> strictly because of case.

My example was meant to emphasize that case DOES make sense even for
normal, everyday sentences. It was also an effort at some humor....
:-)

> Finally, all hyperbole aside, what is the problem with _providing_ an intrinsic
> mechanism to do case-insensitive comparisons?

I never said that stl should not provide one. But I just dont see
anything outrageous about the fact that it doesnt. stl provides a core
set of features that can be infinitely expanded. there are libraries
like Boost that are built around and over stl that you can use to get
these features if you do not want to build them on your own....

I just dont see any reason to get emotional about the fact that stl
does not provide case insensitive strings. The world is case sensitive
or insensitive depending on where you look. Again, as someone rightly
pointed out, the prime factor is the context...

Anyways, I guess we have beaten the problem to death and we could as
well have implemented a case insensitive string compare by providing
our own char traits in a fraction of the time that we spent typing out
all these case sensitive messages.. :-)

Relax, and Peace....

-Vinayak

Julie

unread,
Aug 4, 2004, 6:03:55 AM8/4/04
to
Allan W wrote:
>
> Julie <ju...@nospam.com> wrote
> > Absolutely. A local-specific case-insensitive comparator may be far from
> > trivial to implement. In cases where it can't be implemented due to
> > locale-specific context issues, then that comparator is simply not
> > available. In those locales where it can be implemented, then it is
> > provided. I don't consider this half-baked, simply providing what
> > _can_ be provided, rather than an 'all or nothing' approach.
>
> I hope the problems with this approach are apparent.
<snip>

Well, to be honest, none of this discussion relating to case
conversion/comparison for some languages is all that clear. The explanations
have been weak and far from enlightening, and my character translation
experience is pretty much limited to ASCII where upper/lower case is well
defined as far as I'm concerned.

Presumably there are more than just a few out there that need case insensitive
comparisons? What do they do?

- Write their own std::string comparator?

- Use a platform/compiler specific case-insensitive string comparator function
such as stricmp?

- Use some third-party library/Boost?

- ???

Message has been deleted

Gerhard Menzl

unread,
Aug 4, 2004, 8:40:50 AM8/4/04
to
llewelly wrote:

> > FWIW, I just created a Word document, entered "Maße", set the language of the
> > entire document to "German (Germany)" and did a (what Word calls)
> > case-insensitive search for "MASSE" (and "MASZE", to err on the safe side);
> > Word didn't find it. Same result for "MASSE" in the text and "Maße" in
> > the search argument.
>
> My question is: Do you think a German-speaker who was an ordinary
> computer user, would find this behavior an unpleasant surprise?

It depends. Google finds "Masse" and "Maße", no matter which of the two
you type in. This is a double-edged sword, but at least Google is aware
of the transformation. In my experience, the benefit of not missing hits
is greater than the drawback of getting false positives.

--
Gerhard Menzl

Humans may reply by replacing the obviously faked part of my e-mail
address with "kapsch".

Thomas Maeder

unread,
Aug 4, 2004, 8:49:17 AM8/4/04
to
llewelly <llewe...@xmission.dot.com> writes:

> > FWIW, I just created a Word document, entered "Ma=DFe", set the lang=
uage
> > of the entire document to "German (Germany)" and did a (what Word ca=


lls)
> > case-insensitive search for "MASSE" (and "MASZE", to err on the safe
> > side); Word didn't find it. Same result for "MASSE" in the text and

> > "Ma=DFe" in the search argument.


>
> My question is: Do you think a German-speaker who was an ordinary
> computer user, would find this behavior an unpleasant surprise?

No. The problem can't be correctly solved, so I'm not surprised of anythi=
ng
here.

What I am surprised of is that the moderators let all this happen in this
newsgroup. :-)

Thomas Maeder

unread,
Aug 4, 2004, 9:57:19 AM8/4/04
to
Niklas Matthies <usenet...@nmhq.net> writes:

> While not a word processor, my online banking application converts
> "Maße" to "MASSE" in the "reason for transfer" field, and the search
> function also correctly finds the transfer containing "MASSE" when
> searching for "maße". (Actually I tested this with "Straße".)
> Google also finds "ss"/"ae"/etc. when searching for "ß"/"ä"/etc.
> and vice versa, as well as http://dict.leo.org/ and many other
> dictionaries.

Converting "Maße" to "MASSE" is ok, but converting the other way round
is presumptuous, as is telling that the two mean the same thing.

Unless you are in a well-defined context that is a small subset of the
context of a language, or a locale. Such as, as you tell me, your
application,
or, as another example, Internet host names.


"Straße" is a different case because there is no word "Strasse" in
German German.


[And I don't see how "Maße" can be a reason for transfer, but that may
be$
me :-)]


> I would say that this is pretty much expected behavior from German-
> aware software, despite resulting in "incorrect" matches. IMHO such
> matches are in the same category as those you would get with any
> homographic word (e.g. "record").

If you can live with false positives, that's ok.

But functionality that delivers false positives should not be
standardized.

ka...@gabi-soft.fr

unread,
Aug 4, 2004, 10:04:51 AM8/4/04
to
llewelly <llewe...@xmission.dot.com> wrote in message
news:<86wu0h5...@Zorthluthik.local.bar>...
> Thomas Maeder <mae...@glue.ch> writes:

>> Ray Lischner <rl....@tempest-sw.com> writes:

>>>> Locale specificity is not sufficient, understanding of the text
>>>> is required.

>>> I'm curious. How does word processing software perform a
>>> case-insensitive search in German? I guess they detect incorrect
>>> matches, and it is up to the user to decide what to do with the
>>> results.

>> FWIW, I just created a Word document, entered "Maße", set the
>> language of the entire document to "German (Germany)" and did a
>> (what Word calls) case-insensitive search for "MASSE" (and "MASZE",
>> to err on the safe side); Word didn't find it. Same result for
>> "MASSE" in the text and "Maße" in the search argument.

> My question is: Do you think a German-speaker who was an ordinary
> computer user, would find this behavior an unpleasant surprise?

Perhaps:-). This is getting a bit away from C++ (and I'm probably not a
typical computer user, so take my comments with a bit of salt), but...

There are two contexts where the case issue comes up when dealing in
normal text: converting, and searching. Given the word "Maße" in normal
texte, I would expect converting it to caps to give "Masse"; if a
program claims to support case conversion, and doesn't do this, it is,
IMHO, broken. I don't expect the reverse to be true -- it may be
because I am computer aware, and realize the limitations, but I wouldn't
expect a program, asked to convert "MASSE" to lower case, to be able to
tell whether the results should be "Maße" or "Masse"; for that matter, I
would be very impressed if the program realized that it was dealing with
a word which, even in lower case, must start with a capital letter.
Similarly, it would never occur to me to do a case insensitive search
for "MASSE"; I would expect, however, that a case insensitive search for
"maße" or "Maße" match MASSE.

The C++ library has all of the necessary functions for most reasonable
uses, see the std::collate facet, or std::locale::operator(), for
example. Logically, they ARE part of the locale section of the library,
since they very much depend on the locale. Regretfully (although I
don't see any reasonable alternative), the standard doesn't require any
locales except "C" to be present, and text is case sensitive in the C
locale, so you have no guarantee of being able to do a case insensitive
comparison. IMHO, it wouldn't be too much for the standard to require
at least one language/country specific locale to be furnished, although
in the absence of a standard for naming such locales, I'm not sure how
much this would help. From a quality of implementation point of view,
I think a minimum would include an international locale (based on
English, since that is the international language) and a locale for the
country in which the compiler is being sold -- for countries like
Belgium, Canada and Switzerland, this means in fact several locales.

I also see a need for OS specific locales, e.g. "POSIX" or "WINDOWS".
(The Posix standard requires it for Posix conformant systems.) Thus, in
"POSIX", the collate facet is case sensitive, in Windows no. Here, too,
it would seem acceptable, at least to me, that the standard require such
a locale; possibly even that it give it a fixed name (e.g. "SYSTEM").

As a passing thought, I wonder what rules Windows uses for its case
insensitive filename comparison. In French, for example, 'i' == 'I',
but this would definitly not be the case in Turkish, where you should
have 'i' == '\u0130' and '\u0131' == 'I'. I suppose that the obvious
solution is just to ignore all accents, with 'i' == 'I' == '\u0130' ==
'\u0131', but this will lead to ambiguous names in Turkish, and probably
some other languages as well. And of course, "Maße", "MASSE" and
"MASZE" must compare equal as well, or the system will be quite
counter-intuitive in most German speaking areas (but not Switzerland).

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Niklas Matthies

unread,
Aug 4, 2004, 11:19:49 PM8/4/04
to
On 2004-08-04 13:57, Thomas Maeder wrote:
> Niklas Matthies <usenet...@nmhq.net> writes:
>
>> While not a word processor, my online banking application converts
>> "Maße" to "MASSE" in the "reason for transfer" field, and the search
>> function also correctly finds the transfer containing "MASSE" when
>> searching for "maße". (Actually I tested this with "Straße".)
>> Google also finds "ss"/"ae"/etc. when searching for "ß"/"ä"/etc.
>> and vice versa, as well as http://dict.leo.org/ and many other
>> dictionaries.
>
> Converting "Maße" to "MASSE" is ok, but converting the other way round
> is presumptuous, as is telling that the two mean the same thing.

But subsequent application of the search function does exactly that.
When searching for "Masse", the search function cannot tell whether
you mean to include the "Masse" transcription of "Maße" or not.

Or, incidentally, when searching for "Thomas Maeder" it cannot tell
whether you mean to (also) get matches for "Thomas Mäder" or not.

German-language strings like "Maeder" are inherently ambiguous in
general, since you can't tell whether this may mean "Mäder" because
of restricted input capabilities (say a US keyboard without an input
method for non-US characters) or a restricted character set (as is the
case with bank transfers), or whether this is really meant to be
"Maeder" and only "Maeder".

:


>> I would say that this is pretty much expected behavior from German-
>> aware software, despite resulting in "incorrect" matches. IMHO such
>> matches are in the same category as those you would get with any
>> homographic word (e.g. "record").
>
> If you can live with false positives, that's ok.
>
> But functionality that delivers false positives should not be
> standardized.

My point is that searching for "record" (case-insensitive or not) can
also result in such semantic false positives.

-- Niklas Matthies

ka...@gabi-soft.fr

unread,
Aug 4, 2004, 11:22:45 PM8/4/04
to
vs_ragh...@hotmail.com (Vinayak Raghuvamshi) wrote in message
news:<9afa978c.04072...@posting.google.com>...
> sgga...@gmail.com (Ganesh) wrote in message
> news:<619f36eb.04072...@posting.google.com>...

> > It is a surprise to most of the "common" C++ programmers to learn
> > that std::string provides no simple way of doing case-insensitive
> > comparison.

> Well, isn't everything case sensitive in C++? so why surprised at


> strings being treated in case sensitive manner? :-)

I appreciate the smiley.

> STL is kind of saying "hey, strings and everything else are case
> sensitive in C++, but you can replace any of my methods with your own
> in a pluggable manner...". I think it is fair enough...

> Simple way of doing case-insensitive comparison?

> stricmp(dest.c_str(),src.c_str());

Which just moves the problem. Now you have to write a function stricmp.

What's wrong with something like:

std::map< std::string, MyClass, std::locale >
myMap( std::locale( "de_DE" ) ) ;

? Or whatever, according two what you are doing. For a simple
comparison,

if ( std::use_facet< std::collate< char > >( std::locale() )
.compare( s1.data(), s1.data() + s1.size(),
s2.data(), s2.data() + s2.size() ) == 0 ) ...

should do the trick. Althoug one does wonder why the interface uses
char const*, and not std::string. I'd definitely consider wrapping this
one in a global function.

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

ka...@gabi-soft.fr

unread,
Aug 4, 2004, 11:23:35 PM8/4/04
to
all...@my-dejanews.com (Allan W) wrote in message
news:<7f2735a5.04080...@posting.google.com>...

> Julie <ju...@nospam.com> wrote
> > Absolutely. A local-specific case-insensitive comparator may be far
> > from trivial to implement. In cases where it can't be implemented
> > due to locale-specific context issues, then that comparator is
> > simply not available. In those locales where it can be implemented,
> > then it is provided. I don't consider this half-baked, simply
> > providing what _can_ be provided, rather than an 'all or nothing'
> > approach.

> I hope the problems with this approach are apparent.

They are:-).

> If the standard says that such a comparator *MAY* be made available by
> an implementation, this implies that it might *NOT* be available.
> Which means that your program can't assume that it exists on all
> compliant platforms. Which means that your portable program can't use
> it.

I agree, but that IS what the standard says. Furthermore, it says that
if it is available, you don't know the name of it, and if you try and
use it, and it isn't available, or you get the wrong name, you get a
run-time exception (and not a compiler error).

Personally, I find it an awkward situation, and it has really caused me
problems. (It caused even more problems because one compiler wasn't
conform -- if the service wasn't available, it just did something else,
rather than tell me.)

Anyway, with the correct locale's installed on under Solaris, somethink
like:
std::sort( v1, v2, std::locale( "de_DE" ) ) ;
should work. (I can't test it, because someone removed all of the
locales on my machine.) The problem is, although exactly the same
functionality is available under Windows (again, perhaps dependant on
the installation of some particular software), the string constant is
probably different -- worse, I have no idea what it should be.

> The workaround would be to have the standard specify a preprocessor
> symbol that says if the comparator is available or not.

The problem is that, at least in the Unix world (but I think that the
situation is similar under Windows), whether the functionality is
available depends on what is or is not installed on the machine where
the code is run, and not on the machine where it is compiled. Ideally,
supposing the Posix naming convention, one would like to see all
combinations of language and country available; practically, the demand
for something like "eu_AL" (Basque, as used in Abania) is small enough
that I'm sure it will never be supported. And how could an
implementation pretend to support "zh_CN" (Chinese) for std::string?

In sum, the current situation is totally unacceptable, but I'm not sure
what is both acceptable and reasonably possible. So until I can propose
a workable alternative, I'm living with it.

> Then your program could use the library version if it is available,
> otherwise it could roll it's own...

> But if you're able to roll your own for the cases where it's needed,
> why can't you just roll your own 100% of the time? It's actually LESS
> work to do this (because you don't have to muck around with
> preprocessor directives).

The problem is that most programmers can't roll their own. The whole
point (well, one major point) of having locales is that the programmer
doesn't know all of the rules for all of the locales which he will have
to support.

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Andrea Griffini

unread,
Aug 5, 2004, 6:53:40 AM8/5/04
to
On 3 Aug 2004 07:38:22 -0400, vs_ragh...@hotmail.com (Vinayak
Raghuvamshi) wrote:

>I never said that stl should not provide one. But I just dont see
>anything outrageous about the fact that it doesnt. stl provides a core
>set of features that can be infinitely expanded. there are libraries
>like Boost that are built around and over stl that you can use to get
>these features if you do not want to build them on your own....

This thread really reminds me the one about the ability
of trimming trailing or leading spaces from a string.
No. The standard library is not providing that "exotic"
feature either and you must code one yourself.

If you go looking back to that thread you'll find a lot
of explanation about why removing trailing spaces from
a string is:

- pointless
- not well defined
- locale dependent
- not the job of std::string
- immoral
- uncool


Now do this experiment...

Imagine you asking for a glass of water. And imagine
you that the bar tender start discussing ad infinitum
about exactly does it mean "a glass" (it's clear that
in various countries the average glass size is quite
different, by several percentage points!!... and don't
expect that big/small will be enough to get out of
that) and exactly what you mean with water (and this
can't be simply gas/no gas... because you sure well
know that there are a jillion different types of
water that are not perfectly equally tasting).
Hey!!... may be it's easier if you fill up a form
about what kind of water you're looking for, given
the chemical properties and the temperature you would
like it to be (and of course this can't be just
"cold" or not... as it's clear that "cold" is both
subjective and context dependent).

Now let me guess what would be your reaction...

Probably the reaction would be just leaving the pub
babbling something like "geesh, you're crazy" or,
if you have a gun and are really really thirsty,
stuffing your gun up the nose of the bar tender and
saying with a warm calm voice "now I'll count to ten...".


In my opinion a newbie reading this discussion about
converting a string to uppercase or removing trailing
spaces from a string will have the strong temptation
to just leave the language. And who can blame him
or her for that ? In my opinion the common sense left
this dark area of C++ long time ago.

Andrea

Alf P. Steinbach

unread,
Aug 5, 2004, 10:15:13 AM8/5/04
to
* Andrea Griffini:

Applause!

But also, there is a difference in that the standard library is
more like the organization that provides tap water to the city,
and exact standards must be defined and guaranteed.

Common sense is to choose a sensible, practical set of standards
and focus on the guarantee/delivery bit; but as you've noted
discussions tend to instead focus on choosing the most impractical
and unusable but in some academic sense "perfect" set of standards
while using the fact that such perfection cannot be guaranteed or
even generally achieved as argument to not provide anything at all.

For what it's worth, I think the practical set of standards should
be character code oriented (forget about locales and all that stuff),
which is essentially what Julie suggested before getting bogged down
in demands for definitions of "glass", "water", "temperature" etc.

If the character code provides a unique uppercase character, then
that's it (regardless of idiosyncracies of English, German or for
that matter Norwegian); otherwise, leave the character as-is. This
means that tolower(toupper(s)) == tolower(s) does not hold in general.
And that's very very very OK, because that's how it Really Is (TM).

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

ka...@gabi-soft.fr

unread,
Aug 5, 2004, 10:16:51 AM8/5/04
to
Julie <ju...@nospam.com> wrote in message
news:<410F1DD9...@nospam.com>...

> Allan W wrote:
> > Julie <ju...@nospam.com> wrote
> > > Absolutely. A local-specific case-insensitive comparator may be
> > > far from trivial to implement. In cases where it can't be
> > > implemented due to locale-specific context issues, then that
> > > comparator is simply not available. In those locales where it can
> > > be implemented, then it is provided. I don't consider this
> > > half-baked, simply providing what _can_ be provided, rather than
> > > an 'all or nothing' approach.

> > I hope the problems with this approach are apparent.
> <snip>

> Well, to be honest, none of this discussion relating to case
> conversion/comparison for some languages is all that clear. The
> explanations have been weak and far from enlightening, and my
> character translation experience is pretty much limited to ASCII where
> upper/lower case is well defined as far as I'm concerned.

The explinations have largely been based on examples, I think. There's
no real theory behind it -- natural language conventions don't follow
rigorous mathematical rules which can be logically explained. The only
important thing to note is that case conversion is not necessarily a
bijection, and the case insensitive comparison isn't a well defined
operation.

> Presumably there are more than just a few out there that need case
> insensitive comparisons? What do they do?

> - Write their own std::string comparator?

> - Use a platform/compiler specific case-insensitive string
> comparator function such as stricmp?

> - Use some third-party library/Boost?

> - ???

The first thing I always do is define what I want. I think the main
point of many of us posting here is that the expression "case
insensitive comparison" is not an adequate specification to begin
anything; it leaves a lot of questions unanswered. So the first thing
is to actually define what the application needs. The needs of a Pascal
compiler (which uses a very limited set of input characters) are
different from those of a database of German book titles. Once I define
what is actually needed, I then see if anything existing will do the
job. If it will, I use it. If it won't, I write what is needed.

For more information, you might want to look at some of the Unicode
technical reports (http://www.unicode.org/unicode/reports/index.html);
UTS 10 (http://www.unicode.org/unicode/reports/tr10/) is particularly
relevant. In fact, if your concerns are collating or comparing text in
a natural language (including English), I would consider it necessary
reading -- even in English, you would want "naïve" == "NAIVE".

For artificial languages (e.g. Pascal, SQL, Windows filenames), the
problem is usually much simpler, and a simple one to one mapping of
lower case characters to upper case characters is often sufficient.

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

ka...@gabi-soft.fr

unread,
Aug 6, 2004, 11:13:49 AM8/6/04
to
Andrea Griffini <agr...@tin.it> wrote in message
news:<2pi3h0lsn1pp6ttoj...@4ax.com>...

> Now do this experiment...

Now that's an interesting example. Because in France, at least, if you
ask for water in a restaurant, the first thing the waiter is likely to
do is ask you what kind. I don't see how it could be otherwise, since
both sparkling and flat water is widespread. In German or in Italy, he
will automatically bring you a bottle of the house brand mineral water
(always with gas). Whereas in America, it will be tap water with lots
of ice.

In sum, no reasonable person would expect a simple solution for an
incomplete question.

> In my opinion a newbie reading this discussion about converting a
> string to uppercase or removing trailing spaces from a string will
> have the strong temptation to just leave the language. And who can
> blame him or her for that ? In my opinion the common sense left this
> dark area of C++ long time ago.

Is it a lack of common sense to want to know what the function should do
before trying to find it? The C++ standard DOES have a function for
case insensitive comparison of strings: std::collate::compare.
Obviously, it's a template function (since it has to deal with char and
wchar_t), obviously, it is in the locale section (since, like water,
what one intuitively expects from "case" depends on local conventions).
And just as obviously, the user can supply additional versions for
himself, since this is definitly a case where one size doesn't fit all.

(Removing trailing spaces is a different issue -- it is locale
dependant, but other than that, I don't see any real problems.)

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Andrea Griffini

unread,
Aug 7, 2004, 9:45:39 PM8/7/04
to
On 6 Aug 2004 11:13:49 -0400, ka...@gabi-soft.fr wrote:

>In sum, no reasonable person would expect a simple solution for an
>incomplete question.

Every question is incomplete; it all boils down where you want
to stop adding details. Common sense is trying to stop asking
details at least before who is talking to you is tempted to
hit your nose with a punch.

If someone asks you "where is north" now I wonder if you are
going to say "do you mean geographical north, magnetical north
or geomagnetical north ?". You can really go forever... even
establishing which of the three major north pole one is
concerned about the question is far from being "complete".
For example supposing the subject is the magnetical north a
question could be "Are you asking what direction would a compass
be pointing to (hence considering local magnetic field modifications),
or what geodesic line would pass from here and the north magnetic
pole supposing the earth being a sphere (so you're interested
in where the pole is) ?".

But my guess is that your nose would be already bleeding by then.

>Is it a lack of common sense to want to know what the function should do
>before trying to find it?

Lack of common sense is the missing of "s.upper()" or "upper(s)"
working on std::string by default. It would have been of course
ok being able to handle complexity needed for chinese ... but
ONLY if that wasn't going to annoy where it's not needed.

To me it's evident (and Francis confirmed) that the prolem is
the "committee effect" that required to avoid assuming that
american english should be the "default". Or that anything
was going to be the default because that would have been
"unfair" for the others.

The situation closely reminds me about the TIFF file format
situation... where because it would have been "unfair" to
choose between big-endian or little-endian the totally nonsense
solution is that there is first a single byte that tells if the
rest will use the little-endian or big-endian representation.
With the net result that now BOTH little endian and big endian
architectures have added complexity when reading those files
and that writing portable code handling TIFF files is harder
because you'll have BOTH the compile-time endian-ness problem
AND the run-time endian-ness problem.

IMO drawing straws would have been a better solution. By far.

>The C++ standard DOES have a function for case insensitive
>comparison of strings: std::collate::compare.

But no s.upper() or upper(s) ... because that would be

- pointless
- not well defined
- locale dependent

- immoral
- uncool
- having it working for american english would be unfair
for languages where it's an unsolvable problem (IIUC
for german not even a dictionary could be enough... but
a syntax analysis or even an semantical analysis of the
meaning of the text is required).

>And just as obviously, the user can supply additional versions for
>himself, since this is definitly a case where one size doesn't fit all.

I don't need it solved in the general case. I can solve it
to any extent I want if I have to. And I'm not forced to put
my solution in the frameset of the standard library.

Let me add that I probably wouldn't. Reading Herb Sutter's
exceptional C++ items 2 and 3 made clear for me that I'll
stay as far as possible from that. My job is solving problems
using C++ as a tool, not fighting with C++ for the fun of it.

Lack of common sense is providing complex solutions (or
complex infrastructure where you should put your complex
solution) for complex cases, ignoring providing reasonable
simple solutions for simple cases.

>(Removing trailing spaces is a different issue -- it is locale
>dependant, but other than that, I don't see any real problems.)

But where are the trim functions in the standard library ?

Anyway I don't think that anything I may say would convince
you that there's lack of common sense in what C++ proposes.
If you can't see why the following is ludicrous

if ( std::use_facet< std::collate< char > >( std::locale() )
.compare( s1.data(), s1.data() + s1.size(),
s2.data(), s2.data() + s2.size() ) == 0 ) ...

probably no amount of explanation would be enough.


Andrea

Eugene Gershnik

unread,
Aug 7, 2004, 9:59:28 PM8/7/04
to
ka...@gabi-soft.fr wrote:
> The C++ standard DOES have a function
> for case insensitive comparison of strings: std::collate::compare.

Let's compare it to other languages/dialects for a simple task: condition
based on an internet protocol name which is always English and locale
independent. I am perfectly aware that the snippets below are not equivalent
but this is besides the point which is how much work should a simple and
frequent task take.

(Disclaimer: Code typed without compiling)

<popular language #1>

String protocol = "HTTP";

if (protocol.compareToIgnoreCase("http") == 0)
{
...
}

</popular language #1>

<popular language #2>

string protocol = "HTTP";

if (String.Compare(protocol, "http", true) == 0)
{
...
}

</popular language #2>

<what C++ programmers usually do>

const string protocol = "HTTP";

if (_stricmp(protocol.c_str(), "http") == 0)
{
...
}

</what C++ programmers usually do>

<standard C++>

const string protocol = "HTTP";

const char * const protocol_begin = protocol.c_str();
const char * const protocol_end = protocol_begin + protocol.length();
const char test_begin[] = "http";
const char * const test_end = test_begin + sizeof(test_begin) - 1;
if (use_facet<collate<char> >(locale::classic()).compare(
protocol_begin,
protocol_end,
test_begin,
test_end) == 0)
{
...
}

</standard C++>

--
Eugene

Bo Persson

unread,
Aug 8, 2004, 5:50:09 PM8/8/04
to

"Andrea Griffini" <agr...@tin.it> skrev i meddelandet
news:n509h0tmoel596ouj...@4ax.com...

> On 6 Aug 2004 11:13:49 -0400, ka...@gabi-soft.fr wrote:
>
> >Is it a lack of common sense to want to know what the function should
do
> >before trying to find it?
>
> Lack of common sense is the missing of "s.upper()" or "upper(s)"
> working on std::string by default. It would have been of course
> ok being able to handle complexity needed for chinese ... but
> ONLY if that wasn't going to annoy where it's not needed.

It would work fine for chinese (in a way), because they don't even have
the concept of cases.

>
> To me it's evident (and Francis confirmed) that the prolem is
> the "committee effect" that required to avoid assuming that
> american english should be the "default". Or that anything
> was going to be the default because that would have been
> "unfair" for the others.

To me at least, it seems utterly silly to have an ISO standard demand
functions that only work for US english. (Yes I know about the C
library!)


>
> >The C++ standard DOES have a function for case insensitive
> >comparison of strings: std::collate::compare.
>
> But no s.upper() or upper(s) ... because that would be
>
> - pointless

This is the closest. We already have a bunch of totally useless
character classification functions in the C library. Why add more of
those to the C++ library?

> - not well defined
> - locale dependent
> - immoral
> - uncool
> - having it working for american english would be unfair
> for languages where it's an unsolvable problem (IIUC
> for german not even a dictionary could be enough... but
> a syntax analysis or even an semantical analysis of the
> meaning of the text is required).

This is an ISO standard. Why add US-only functions to that?

Perhaps the ANSI version of the standard could add those?

Bo Persson

James Kanze

unread,
Aug 8, 2004, 6:04:20 PM8/8/04
to
Thomas Maeder <mae...@glue.ch> writes:

|> Niklas Matthies <usenet...@nmhq.net> writes:

|> > While not a word processor, my online banking application converts
|> > "Maße" to "MASSE" in the "reason for transfer" field, and the
|> > search function also correctly finds the transfer containing
|> > "MASSE" when searching for "maße". (Actually I tested this with
|> > "Straße".) Google also finds "ss"/"ae"/etc. when searching for
|> > "ß"/"ä"/etc. and vice versa, as well as http://dict.leo.org/ and
|> > many other dictionaries.

|> Converting "Maße" to "MASSE" is ok, but converting the other way
|> round is presumptuous, as is telling that the two mean the same
|> thing.

All of the applications I know are either case sensitive, or treat
everything as upper case, for historical reasons. One important
application I don't know is what Windows does with filenames, but MS-DOS
also treated them as upper case, so I suspect that this is also the
case. Thus, if you have a file named Maße.txt, and you try and create
one named MASSE.TXT, the system should refuse (or replace the original
file, depending on the context).

Similarly, for a generalized text insensitive search, I would convert to
upper case, and treat all accented characters as being equal to the
unaccented version. There would be some false positives, but this is
generally preferable to missing something, or requiring the user to make
several searches or to use some complex regular expression to find what
he is looking for.

|> Unless you are in a well-defined context that is a small subset of
|> the context of a language, or a locale. Such as, as you tell me,
|> your application, or, as another example, Internet host names.

Internet domain names are a special case -- only seven bit ASCII is
allowed, so no ambiguities are possible. But unless you are writing
protocol level software (e.g. a new implementation of DNS), you should
probably not play with it.

|> "Straße" is a different case because there is no word "Strasse" in
|> German German.

But there is in Swiss German.

And to add an additional complication, when I learned German, there was
no word "dass" in German, just "daß". Today, it is the reverse.

|> [And I don't see how "Maße" can be a reason for transfer, but that

|> may be me :-)]

I suspect that he just experimented with the two lines of free text
allowed in a standard bank transfer. It doesn't have to be reasonable,
as long as both parties understand why the transfer is being made.

|> > I would say that this is pretty much expected behavior from

|> > German-aware software, despite resulting in "incorrect" matches.


|> > IMHO such matches are in the same category as those you would get
|> > with any homographic word (e.g. "record").

|> If you can live with false positives, that's ok.

|> But functionality that delivers false positives should not be
|> standardized.

The problem is what the program is being used for. IMHO:

- there is a locale specific function, std::collate<>::compare, which
is standardized, and would seem to fit the bill, and

- it probably wouldn't be a bad idea to add a requirement for a locale
for comparing system specific filenames -- Posix requires a locale
named "POSIX", but as far as I know, Windows doesn't require
anything, and there really should be a portable name that one could
use.

For other cases of case insensitive comparison... Who knows what is
needed?

--
James Kanze


Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

9 place Sémard, 78210 St.-Cyr-l'École, France +33 (0)1 30 23 00 34

James Kanze

unread,
Aug 8, 2004, 6:13:53 PM8/8/04
to
Andrea Griffini <agr...@tin.it> writes:

|> On 6 Aug 2004 11:13:49 -0400, ka...@gabi-soft.fr wrote:

|> >In sum, no reasonable person would expect a simple solution for an
|> >incomplete question.

|> Every question is incomplete; it all boils down where you want to
|> stop adding details. Common sense is trying to stop asking details
|> at least before who is talking to you is tempted to hit your nose
|> with a punch.

|> If someone asks you "where is north" now I wonder if you are going
|> to say "do you mean geographical north, magnetical north or
|> geomagnetical north ?". You can really go forever... even
|> establishing which of the three major north pole one is concerned
|> about the question is far from being "complete". For example
|> supposing the subject is the magnetical north a question could be
|> "Are you asking what direction would a compass be pointing to (hence
|> considering local magnetic field modifications), or what geodesic
|> line would pass from here and the north magnetic pole supposing the
|> earth being a sphere (so you're interested in where the pole is) ?".

|> But my guess is that your nose would be already bleeding by then.

It's not quite equivalent. Everywhere I've ever been, if someone speaks
of "north", they mean geographical north. In most of the places I've
actually worked, however, there really are ambiguïties concerning case
insensitive comparisons, and e.g. 'é' and 'E' are not considered equal
when comparing, say, filenames, but are when comparing other things.

For better or worse, just saying you want a case insensitive comparison
is NOT a sufficient specification to do anything about in French or
German. It is in English, and I think in Italian as well (although even
there, one might expect stricmp( "vertù", "VERTU" ) to return true).

And it is a real fact that a significant number of users of C++ are not
working in English speaking environments.

|> >Is it a lack of common sense to want to know what the function
|> >should do before trying to find it?

|> Lack of common sense is the missing of "s.upper()" or "upper(s)"
|> working on std::string by default. It would have been of course ok
|> being able to handle complexity needed for chinese ... but ONLY if
|> that wasn't going to annoy where it's not needed.

I would argue that something like s.upper() or toUpper(s) would be a
good idea. I would also argue, however, that the actual signature
should be something like:

std::string::upper( std::locale const& = std::locale() ) ;

I do agree that there are many contexts where it is clear. I have
nothing against reasonable defaults.

|> To me it's evident (and Francis confirmed) that the prolem is the
|> "committee effect" that required to avoid assuming that american
|> english should be the "default". Or that anything was going to be
|> the default because that would have been "unfair" for the others.

There is a political problem with a "default" of American English, at
least when the default can't be overridden. In this case, it would seem
to me that there is a good solution, which allows overriding, or even
setting the default to something else.

[...]


|> >The C++ standard DOES have a function for case insensitive
|> >comparison of strings: std::collate::compare.

|> But no s.upper() or upper(s) ... because that would be

|> - pointless
|> - not well defined
|> - locale dependent
|> - immoral
|> - uncool
|> - having it working for american english would be unfair
|> for languages where it's an unsolvable problem (IIUC
|> for german not even a dictionary could be enough... but
|> a syntax analysis or even an semantical analysis of the
|> meaning of the text is required).

More likely because despite the name, std::string really has very little
to do with text. It's just a glorified container for small integers. Or
whatever -- the standard says you can have std::basic_string<double>
(although it core dumps with g++ on Solaris).

I'll admit that I'd find even a limited toupper more use than
basic_string<double>. Precisely because of all the problems we've been
talking about -- you can't implement it using a character by character
translation, so it has to work on strings. IMHO, it must be locale
specific, but that's not really a problem.

On the other hand, it doesn't require any cool template
meta-programming, so I guess that's a good reason not to have it.

|> >And just as obviously, the user can supply additional versions for
|> >himself, since this is definitly a case where one size doesn't fit
|> >all.

|> I don't need it solved in the general case. I can solve it to any
|> extent I want if I have to. And I'm not forced to put my solution in
|> the frameset of the standard library.

|> Let me add that I probably wouldn't. Reading Herb Sutter's
|> exceptional C++ items 2 and 3 made clear for me that I'll stay as
|> far as possible from that. My job is solving problems using C++ as a
|> tool, not fighting with C++ for the fun of it.

Sounds like we have similar problems:-). My customers pay me for
working code, not for stress testing compilers.

Maybe the only difference is that I've really had to deal with "case
insensitive" look-ups involving "Maße":-). I'll admit that I'm very
sensitized to the problem. (And a quick glance at the thread shows that
almost all of the people asking for a more precise specification work or
have worked in German speaking areas. Probably not by chance.)

|> Lack of common sense is providing complex solutions (or complex
|> infrastructure where you should put your complex solution) for
|> complex cases, ignoring providing reasonable simple solutions for
|> simple cases.

Would you be talking about locale, by any chance?

|> >(Removing trailing spaces is a different issue -- it is locale
|> >dependant, but other than that, I don't see any real problems.)

|> But where are the trim functions in the standard library ?

Where is any support for text? Where is a true character type?

Where is networking? Where is a GUI?

|> Anyway I don't think that anything I may say would convince you that
|> there's lack of common sense in what C++ proposes.

If "convince" implies my changing my opinion, no. Because I'm already
convinced of it for a number of things: all of locale, or the
templatization of iostream or string, for example.

Still, it's the only standard we've got, and we can (and have to) live
with it. It could be worse.

|> If you can't see why the following is ludicrous

|> if ( std::use_facet< std::collate< char > >( std::locale() )
|> .compare( s1.data(), s1.data() + s1.size(),
|> s2.data(), s2.data() + s2.size() ) == 0 ) ...

|> probably no amount of explanation would be enough.

What's wrong with a simple wrapper?

And to tell the truth: we're complaining about a lack of proper support
for text in C++. Did you, or any one else, make a proposal? I know I
didn't, and the committee can't standardize something that hasn't even
been proposed.

--
James Kanze


Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

9 place Sémard, 78210 St.-Cyr-l'École, France +33 (0)1 30 23 00 34

James Kanze

unread,
Aug 8, 2004, 6:22:05 PM8/8/04
to
"Eugene Gershnik" <gers...@hotmail.com> writes:

|> ka...@gabi-soft.fr wrote:
|> > The C++ standard DOES have a function for case insensitive
|> > comparison of strings: std::collate::compare.

|> Let's compare it to other languages/dialects for a simple task:
|> condition based on an internet protocol name which is always English
|> and locale independent. I am perfectly aware that the snippets below
|> are not equivalent but this is besides the point which is how much
|> work should a simple and frequent task take.

Is it really that frequet to compare the name of a protocol in a URL?
More frequent than, say, looking up a person's name?

(I'm not saying that it shouldn't be easier. But I don't think that the
language should "prefer" this particular application, either.)

[...]
|> <standard C++>

|> const string protocol = "HTTP";

|> const char * const protocol_begin = protocol.c_str();
|> const char * const protocol_end = protocol_begin + protocol.length();
|> const char test_begin[] = "http";
|> const char * const test_end = test_begin + sizeof(test_begin) - 1;
|> if (use_facet<collate<char> >(locale::classic()).compare(
|> protocol_begin,
|> protocol_end,
|> test_begin,
|> test_end) == 0)
|> {
|> ...
|> }

|> </standard C++>

That's progress:-. Like:

std::cout << std::setprecision( 4 )
<< std::setw( 8 )
<< std::fixed
<< someDouble ;

instead of:

printf( "%8.4f", someDouble ) ;

:-).

Don't worry. One of these day's, you'll be paid by the line, and you'll
appreciate it.

--
James Kanze


Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

9 place Sémard, 78210 St.-Cyr-l'École, France +33 (0)1 30 23 00 34

Niklas Matthies

unread,
Aug 9, 2004, 5:25:30 AM8/9/04
to
On 2004-08-08 22:04, James Kanze wrote:
:

> One important application I don't know is what Windows does with
> filenames, but MS-DOS also treated them as upper case, so I suspect
> that this is also the case. Thus, if you have a file named
> Maße.txt, and you try and create one named MASSE.TXT, the system
> should refuse (or replace the original file, depending on the
> context).

No, because the case-insensitivity of the filesystem needs to be
locale-independent. Otherwise two files with different names with
regard to locale X suddenly have the same name when seen from
locale Y. What Windows seems to do is a character-by-character
translation for those character codes that have a "canonical" 1:1
case mapping.

-- Niklas Matthies

Eugene Gershnik

unread,
Aug 9, 2004, 4:13:53 PM8/9/04
to
James Kanze wrote:
> "Eugene Gershnik" <gers...@hotmail.com> writes:
>
>>> ka...@gabi-soft.fr wrote:
>>> > The C++ standard DOES have a function for case insensitive
>>> > comparison of strings: std::collate::compare.
>
>>> Let's compare it to other languages/dialects for a simple task:
>>> condition based on an internet protocol name which is always
>>> English and locale independent. I am perfectly aware that the
>>> snippets below are not equivalent but this is besides the point
>>> which is how much work should a simple and frequent task take.
>
> Is it really that frequet to compare the name of a protocol in a URL?
> More frequent than, say, looking up a person's name?
>
> (I'm not saying that it shouldn't be easier. But I don't think that
> the language should "prefer" this particular application, either.)

It is quite frequent in the area I work and besides network protocols are
not the only example. File formats, hardware protocols and other "inside the
computer" stuff are almost exclusively US english. I suspect that everybody
who writes software in these areas will find the arguments "ad locale" not
very convincing. I also realize that people who write other kinds of
software will have a different opinion.
Ideally I think a standard library should cater to the needs of both groups.
The current library is too small and uncompetitive compared with the stuff
other languages come with.
Software I have to write today is order of magnitude more complex than it
used to be 10 years ago. My managers simply cannot afford to spend time
implementing functionality like to_upper, socket, thread etc. every time it
is required. I can either use 3rd party libraries for that or reuse company
specific libraries. First approach is too costly or unreliable (from the
managers point of view) and the second usually ends in disaster given the
fact that most in-house library designers are not exactly Andrei
Alexandrescu, or you or any of this forum regulars. The end result is that
within the time, budget and other organizational constraints I have no
choice but to use one of the "popular languages".
Now, before the whole world jumps on me explaining why the arguments above
are BS let me say that I know that pretty well. I also tend to agree with
your often expressed opinion that "popular language #1" is unsuitable for
large scale software development. Despite all that I simply cannot sell C++
to my managers when items like "write/license library X for portable text
manipulation" are on any C++ project plan.

> That's progress:-. Like:
>
> std::cout << std::setprecision( 4 )
> << std::setw( 8 )
> << std::fixed
> << someDouble ;
>
> instead of:
>
> printf( "%8.4f", someDouble ) ;
>
> :-).

To risk wandering too far off-topic even the next version of Java language
will finally include C compatible printf. I find it very encouraging to see
how an old but simple and elegant design still beats all new inventions.

> Don't worry. One of these day's, you'll be paid by the line, and
> you'll appreciate it.

I'll just stop using templates then. A few copies of std::map for each type
will make an early retirement possible ;-)


--
Eugene

ka...@gabi-soft.fr

unread,
Aug 10, 2004, 2:42:46 PM8/10/04
to
"Eugene Gershnik" <gers...@hotmail.com> wrote in message
news:<UaadnaC11Nb...@speakeasy.net>...

> James Kanze wrote:
> > "Eugene Gershnik" <gers...@hotmail.com> writes:

> >>> ka...@gabi-soft.fr wrote:
> >>> > The C++ standard DOES have a function for case insensitive
> >>> > comparison of strings: std::collate::compare.

> >>> Let's compare it to other languages/dialects for a simple task:
> >>> condition based on an internet protocol name which is always
> >>> English and locale independent. I am perfectly aware that the
> >>> snippets below are not equivalent but this is besides the point
> >>> which is how much work should a simple and frequent task take.

> > Is it really that frequet to compare the name of a protocol in a
> > URL? More frequent than, say, looking up a person's name?

> > (I'm not saying that it shouldn't be easier. But I don't think that
> > the language should "prefer" this particular application, either.)

> It is quite frequent in the area I work

And not very frequent in the areas I work in.

> and besides network protocols are not the only example. File formats,
> hardware protocols and other "inside the computer" stuff are almost
> exclusively US english.

Most of the network protocols today use case sensitive UTF-8. DNS is a
bit of an exception. All of the stuff "inside the computer" on the
machines I work on is also case sensitive, and more or less (human)
language independant. It's been a long time since I've had to deal with
7 bit ASCII, and for most of what I see, text is text, and the programs
just consider it a sequence of arbitrary bytes; there are typically a
couple of characters reserved for separating things, and that is it.

> I suspect that everybody who writes software in these areas will find
> the arguments "ad locale" not very convincing.

I don't know. I've done a lot of networking programming, and I find that
the rules are very variable. Generally speaking, case insensitivity only
works when you limit the character set, typically to seven bit ASCII. In
the more recent protocols I've had to deal with, everything was case
sensitive (and UTF-8) precisely to avoid these sort of problems.

> I also realize that people who write other kinds of software will have
> a different opinion. Ideally I think a standard library should cater
> to the needs of both groups.

I think that most of the necessary framework is there, in locale. There
are some problems: the collate facet definitly needs additional
interface functions to handle standard strings, and in the ctype facet,
toupper and tolower really should work on strings, returning new values
(of not necessarily the same length). And arguably, the entire locale
interface should be redesigned to make it usable. But the idea that the
comparisons and conversions should be locale specific is a major step in
the right direction.

After that, the question is what locales should be supported?

> The current library is too small and uncompetitive compared with the
> stuff other languages come with.

True. Note that the Java function String.toUpperCase uses a locale, and
maps "ß" to "SS". Globally, I rather like the idea of a:
std::string::toupper( std::locale const& = std::locale() ) ;
function.

> Software I have to write today is order of magnitude more complex than
> it used to be 10 years ago.

Totally agreed. Ten years ago, my string class had a (naïve) toUpper
function. Nobody demanded the complexity of locale dependant
conversions. Today, the applications I write generally do need it (if
they need toUpper at all -- most of the time, we use case sensitivity to
avoid the problem completely).

> My managers simply cannot afford to spend time implementing
> functionality like to_upper, socket, thread etc. every time it is
> required.

I agree that as it stands, C++ is unusable without a certain number of
additional third party libraries. And that getting these libraries to
work together is not always easy -- most of the ones we use were
initially written before the standard was adopted, and use their own
private string class, for example.

> I can either use 3rd party libraries for that or reuse company
> specific libraries. First approach is too costly or unreliable (from
> the managers point of view) and the second usually ends in disaster
> given the fact that most in-house library designers are not exactly
> Andrei Alexandrescu, or you or any of this forum regulars. The end
> result is that within the time, budget and other organizational
> constraints I have no choice but to use one of the "popular
> languages".

I know the problem. And I agree that it is a problem. A real problem.

> Now, before the whole world jumps on me explaining why the arguments
> above are BS let me say that I know that pretty well. I also tend to
> agree with your often expressed opinion that "popular language #1" is
> unsuitable for large scale software development. Despite all that I
> simply cannot sell C++ to my managers when items like "write/license
> library X for portable text manipulation" are on any C++ project plan.

Been there. Done that. I know what you mean.

> > That's progress:-. Like:

> > std::cout << std::setprecision( 4 )
> > << std::setw( 8 )
> > << std::fixed
> > << someDouble ;

> > instead of:

> > printf( "%8.4f", someDouble ) ;

> > :-).

> To risk wandering too far off-topic even the next version of Java
> language will finally include C compatible printf. I find it very
> encouraging to see how an old but simple and elegant design still
> beats all new inventions.

Actually, if you need formatted text, say in a table, nothing beats a
Cobol PIC clause, or most Basic's PRINT USING:-). For printf style
formatting in C++, however, see GB_Format, at my site
(www.gabi-soft.fr). Implemented for the reasons you discussed earlier: I
needed it, and it wasn't available elsewhere. (Actually, I only needed a
small subset. I went on an implemented 100% of printf formatting because
it was a challange. Especially getting the '*' specifiers for length and
precision to work:-).)

> > Don't worry. One of these day's, you'll be paid by the line, and
> > you'll appreciate it.

> I'll just stop using templates then. A few copies of std::map for each
> type will make an early retirement possible ;-)

Just preprocess, and save the results:-). Or write the program in
Cobol:-).

Still, one of the reasons I'm in demand, and have no real problem
finding a job, even in a depressed market, is that I do know how to
design and write all this stuff which is already part of most other
languages. So don't knock it:-). (Note, however, that even in more
complete languages, there are always things that are missing. My one
large Java project largely involved writing threading, networking and
GUI primitives. And finding the bugs, and their corresponding
work-arounds, in the standard library:-).)

--
James Kanze GABI Software http://www.gabi-soft.fr

Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

ka...@gabi-soft.fr

unread,
Aug 10, 2004, 2:52:39 PM8/10/04
to
Niklas Matthies <usenet...@nmhq.net> wrote in message
news:<slrnchdc13.241...@nmhq.net>...

> On 2004-08-08 22:04, James Kanze wrote:

> > One important application I don't know is what Windows does with
> > filenames, but MS-DOS also treated them as upper case, so I suspect
> > that this is also the case. Thus, if you have a file named
> > Maße.txt, and you try and create one named MASSE.TXT, the system
> > should refuse (or replace the original file, depending on the
> > context).

> No, because the case-insensitivity of the filesystem needs to be
> locale-independent.

That's part of my point. There is no such thing as a locale independant
case-insensitivity. From a human point of view, all interactions with
the computer (commands, filenames, etc.) should be case insensitive.
>From a practical point of view, this leads to the problem of locale
dependency. There is a trade-off which has to be made, and it isn't
always obvious.

> Otherwise two files with different names with regard to locale X
> suddenly have the same name when seen from locale Y.

Quite. And a lot depends on the use of the machine. On a "personal"
computer, there should be no problem using my "personal" locale; on a
shared computer, or a computer accessing a shared file system, this
becomes more problematic, and the system probably has to impose a locale
for itself, e.g.: locale "POSIX" or locale "WINDOWS".

As I've mentioned in another post, it might be worthwhile for the
standard to require such a locale, under a standardized name, e.g. "OS",
or "System", or some such.

> What Windows seems to do is a character-by-character translation for
> those character codes that have a "canonical" 1:1 case mapping.

Which defines a particular locale. It is NOT the "C" locale; in the "C"
locale, character comparisons (strcoll, std::collate<char>::compare,
etc.) are case sensitive.

--
James Kanze GABI Software http://www.gabi-soft.fr

Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Matt Austern

unread,
Aug 13, 2004, 11:04:56 PM8/13/04
to
James Kanze <ka...@gabi-soft.fr> writes:

> And to tell the truth: we're complaining about a lack of proper support
> for text in C++. Did you, or any one else, make a proposal? I know I
> didn't, and the committee can't standardize something that hasn't even
> been proposed.

As LWG chair: I would love to see a proposal for better text handling
in C++, especially if it involved better treatment of i18n issues. I
think we have many of the primitive pieces you'd need to write that
proposal, but they aren't put together in a usable way.

One of the special problems with these sorts of issues, is that
careful handling of i18n, like careful handling of numerics, takes
domain expertise. There aren't many i18n experts on the C++ committee.

Andrea Griffini

unread,
Aug 14, 2004, 6:39:35 AM8/14/04
to
On 13 Aug 2004 23:04:56 -0400, Matt Austern <aus...@well.com> wrote:

>One of the special problems with these sorts of issues, is that
>careful handling of i18n, like careful handling of numerics, takes
>domain expertise. There aren't many i18n experts on the C++ committee.

That is only half of the story... I think that correctness and
carefully handling is just one of dimensions of this problem.
Another IMO very important one is usability.

For example my impression is that the whole C++ I/O subsystem
dismissed usability and now we've joke-looking code snippets
just to write out a number with three decimal digits or to
get the integer value of a string.

I think that for C++ made a few steps in respect to C on I/O,
but these are IMO steps in the wrong direction (i.e. LESS
dynamic code).

Reading the introduction of streams in TCPPPL I remember the
fear of what it was about to come (starting by saying that
IO is difficult and no library will please everyone is like
starting a joke telling that humour is a complex thing, and
not everyone will like the joke).

IMO my fear was justified.

And now I'm trembling in terror waiting of what will be the
result of more work on i18n.

How many people will work on the issue ? I've read somewhere
that the combined IQ of a committee can be easily computed
by starting from 100 and subtracting 5 for every partecipant :-)

Andrea

ka...@gabi-soft.fr

unread,
Aug 16, 2004, 3:54:07 PM8/16/04
to
Andrea Griffini <agr...@tin.it> wrote in message
news:<d8lrh0t62crlaf8kb...@4ax.com>...

> On 13 Aug 2004 23:04:56 -0400, Matt Austern <aus...@well.com> wrote:

> >One of the special problems with these sorts of issues, is that
> >careful handling of i18n, like careful handling of numerics, takes
> >domain expertise. There aren't many i18n experts on the C++
> >committee.

> That is only half of the story... I think that correctness and
> carefully handling is just one of dimensions of this problem. Another
> IMO very important one is usability.

You wouldn't be thinking of <locale> now, would you?

I actually think that part of the problem is due to premature
standardization. For political reasons, the standard must have support
for internationalization. Even though we are at a state where not only
do we not know a good, general solution, we don't really even know how
to specify the problem. Thus, for example, it is obvious that word order
changes between languages -- the Open Systems have implemented support
for this in their versions of printf. What they have implemented has
always sufficed for the type of messages I print -- log's, error
messages, and the sort. But suppose you are generating messages that
should appear to come from a human being, always grammatically correct,
and that things like "n error(s) found" won't do the trick. The
classical solution in the Anglo-Saxon community is something like:

printf( "%d error%s found\n", errorCount, errorCount == 1 ? "" : "s" ) ;

Now, you'll need more than just getting a translated text string to fix
that in Italian. More generally, one might write something like:

"%d %s found\n", errorCount, errorCount == 1 ? "error" : "errors"

-- even in English, the original fails if we are counting feet, rather
than errors. Except that, of course, in many languages, "found" will
also change forms ("trovata"/"trovate", "trouvée"/"trouvées"...).

And of course, some languages have a dual, so you need a different form
if errorCount is 2 as well. And someone once told me that in Russian,
you use the singular behind numbers like twenty-one or thirty-one, which
end in "one".

So the question is: what do we need to support this kind of thing?

And until we've defined the problem in a general way, I find it very
difficult to come up with a solution. I've implemented a number of
different solutions, in different applications, but each time, I
implemented a solution to the subset of the problem which our
application was concerned with (which has always permitted things like
"%d error(s) found\n").

> For example my impression is that the whole C++ I/O subsystem
> dismissed usability and now we've joke-looking code snippets just to
> write out a number with three decimal digits or to get the integer
> value of a string.

I don't know. The C++ I/O subsystem has several very important
improvements over that in C: much has been said about type safety (which
IMHO is very important) and extensibility, but let's not forget the
separation of the formatting from the sinking and sourcing of bytes as
well.

[...]

> How many people will work on the issue ? I've read somewhere that the
> combined IQ of a committee can be easily computed by starting from 100
> and subtracting 5 for every partecipant :-)

The IQ of a croud is the lowest IQ of the people in the croud, divided
by the number of people in the croud. But I don't think that the
committee is really a croud. And sometimes, one person, working alone,
can make a pretty big mess too.

--
James Kanze GABI Software http://www.gabi-soft.fr

Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung

9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

llewelly

unread,
Aug 19, 2004, 8:23:58 PM8/19/04
to
Francis Glassborow <fra...@robinton.demon.co.uk> writes:

> In article <410987CF...@nospam.com>, Julie <ju...@nospam.com>
> writes
>>a - lower case 'A'
>>A - upper case 'A'
>>
>>case-insensitive comparison - a == A
>>
>>Remember, this operates on _characters_, not words.
>
> Fine, but how should accented lower case letter compare to unaccented
> uppercase ones? Please note that we use accents and other diacriticals
> in British English but generally only on lowercase letters.
[snip]

The solution is simple - if a character with more than one 'obvious'
or no obvious case conversion is encountered, the function calls
std::terminate() .

Are you appalled? I know I am. But without such a function, nearly
every program uses some in-house function which does some variant
of case-insensitive comparison, and, when faced with the
situations you describe, silently does the wrong thing.

Alf P. Steinbach

unread,
Aug 20, 2004, 6:06:38 AM8/20/04
to
* llewelly:

> Francis Glassborow <fra...@robinton.demon.co.uk> writes:
>
> > In article <410987CF...@nospam.com>, Julie <ju...@nospam.com>
> > writes
> >>a - lower case 'A'
> >>A - upper case 'A'
> >>
> >>case-insensitive comparison - a == A
> >>
> >>Remember, this operates on _characters_, not words.
> >
> > Fine, but how should accented lower case letter compare to unaccented
> > uppercase ones? Please note that we use accents and other diacriticals
> > in British English but generally only on lowercase letters.
> [snip]
>
> The solution is simple - if a character with more than one 'obvious'
> or no obvious case conversion is encountered, the function calls
> std::terminate() .

I'm appalled.

The solution is simple, as I've described in another posting in this thread:
if the character code, regardless of locale issues and such, defines a
unique uppercase version of the lowercase accented letter, use that (e.g.
accented uppercase); if not, let the character be as-is -- a to_upper()
convenience function is for convenience, not for $50.000 word processing
with tens or hundreds of MiB natural language parser & KBS at bottom.

Incidentally I believe this approach, except perhaps the "ignore locale"
bit, reflects current practice, which is a Good Thing to standardize.


> Are you appalled?

See above.

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Francis Glassborow

unread,
Aug 20, 2004, 10:41:09 AM8/20/04
to
In article <4125491f...@news.individual.net>, Alf P. Steinbach
<al...@start.no> writes

>The solution is simple, as I've described in another posting in this thread:
>if the character code, regardless of locale issues and such, defines a
>unique uppercase version of the lowercase accented letter, use that (e.g.
>accented uppercase); if not, let the character be as-is -- a to_upper()
>convenience function is for convenience, not for $50.000 word processing
>with tens or hundreds of MiB natural language parser & KBS at bottom.

Many programmers assume that c==to_upper(to_lower(c)) and
c == to_lower(to_upper(c)) are universally true. It seems that this
assumption might be false.


--
Francis Glassborow ACCU
Author of 'You Can Do It!' see http://www.spellen.org/youcandoit
For project ideas and contributions: http://www.spellen.org/youcandoit/projects

ka...@gabi-soft.fr

unread,
Aug 20, 2004, 10:43:21 AM8/20/04
to
llewelly <llewe...@xmission.dot.com> wrote in message
news:<86r7q36...@Zorthluthik.local.bar>...
> Francis Glassborow <fra...@robinton.demon.co.uk> writes:

> > In article <410987CF...@nospam.com>, Julie <ju...@nospam.com>
> > writes
> >>a - lower case 'A'
> >>A - upper case 'A'

> >>case-insensitive comparison - a == A

> >>Remember, this operates on _characters_, not words.

> > Fine, but how should accented lower case letter compare to
> > unaccented uppercase ones? Please note that we use accents and other
> > diacriticals in British English but generally only on lowercase
> > letters.

> [snip]

> The solution is simple - if a character with more than one 'obvious'
> or no obvious case conversion is encountered, the function calls
> std::terminate() .

> Are you appalled? I know I am. But without such a function, nearly
> every program uses some in-house function which does some variant
> of case-insensitive comparison, and, when faced with the
> situations you describe, silently does the wrong thing.

Now that's an interesting point of view. I'm intregued.

Basically, your argument is that practically every program uses some
broken version in house, so we should ensconce a specific broken version
in the standard. There is definitly some precedent (think of gets), and
at least in that case, we know where we stand.

I also like your suggestion for handling the awkward cases:-).
Seriously. There ARE contexts where case insensitivity makes sense, but
the only ones I can think of are when the character set is limited to
straight ASCII.

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Kai-Uwe Bux

unread,
Aug 20, 2004, 11:24:55 PM8/20/04
to
Francis Glassborow wrote:

> In article <4125491f...@news.individual.net>, Alf P. Steinbach
> <al...@start.no> writes
>>The solution is simple, as I've described in another posting in this
>>thread: if the character code, regardless of locale issues and such,
>>defines a unique uppercase version of the lowercase accented letter, use
>>that (e.g.
>>accented uppercase); if not, let the character be as-is -- a to_upper()
>>convenience function is for convenience, not for $50.000 word processing
>>with tens or hundreds of MiB natural language parser & KBS at bottom.
>
> Many programmers assume that c==to_upper(to_lower(c)) and
> c == to_lower(to_upper(c)) are universally true. It seems that this
> assumption might be false.
>
>

I would hope that no programmer assumes

c == to_upper( to_lower( c ) )

to be true for c == 'd'. Probably, you meant:

to_upper( c ) == to_upper( to_lower( c ) )

Now, that is something, that I think *should* be true for all values of c.
Do you know an instance, where it fails?


Best

Kai-Uwe Bux

Alf P. Steinbach

unread,
Aug 21, 2004, 12:15:34 AM8/21/04
to
* ka...@gabi-soft.fr:

>
> There ARE contexts where case insensitivity makes sense, but
> the only ones I can think of are when the character set is limited to
> straight ASCII.

Most simple text searching operations can involve case insensitity.

File names, process names, etc.

Usually my own case insensitive searches (as a computer user) require
at least 8259-1, since ASCII doesn't have the Norwegian وّهئطإ, or UCS2,
since many commonly used characters such as m-dash and euro are not in the
basic Latin-1 set.

It's no big deal to support this limited functionality, but the idea that
software simply shouldn't work if it cannot support all potential cases
is not that far-fetched -- because there's much actual software that
behaves that way!

For example, many of Microsoft's C++ development tools have traditionally
only worked 100% in Seattle/Redmond; in Visual Studio 7.1 (the latest
offering when disregarding beta of new version) the "front page", so to
speak, has three tabs called "Projects", "Online Resources" and "My
Profile", and the "Online Resources" either works or not at all, that is,
no result whatsoever, not even gibberish, depending on the locale settings
of the machine and some mysterious factor that nobody's identified so far.
Presumably it doesn't call std::terminate and recover from that but instead
just throws an exception, when you don't have the right character code,
locale, keyboard and so on. What a great idea!

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Francis Glassborow

unread,
Aug 21, 2004, 12:07:46 PM8/21/04
to
In article <cg55le$kth$1...@news01.cit.cornell.edu>, Kai-Uwe Bux
<jkher...@gmx.net> writes

>I would hope that no programmer assumes
>
> c == to_upper( to_lower( c ) )
>
>to be true for c == 'd'. Probably, you meant:
>
> to_upper( c ) == to_upper( to_lower( c ) )

Yes, that is what I should have written.

>
>Now, that is something, that I think *should* be true for all values of c.
>Do you know an instance, where it fails?

I think it easier to give the counter example for:

to_lower(c) == to_lower(to_upper(c));

and consider locales in which lower case accented letters are converted
to upper case un-accented ones.

--
Francis Glassborow ACCU
Author of 'You Can Do It!' see http://www.spellen.org/youcandoit
For project ideas and contributions: http://www.spellen.org/youcandoit/projects

Kai-Uwe Bux

unread,
Aug 21, 2004, 11:09:07 PM8/21/04
to
Francis Glassborow wrote:

> In article <cg55le$kth$1...@news01.cit.cornell.edu>, Kai-Uwe Bux
> <jkher...@gmx.net> writes
>>I would hope that no programmer assumes
>>
>> c == to_upper( to_lower( c ) )
>>
>>to be true for c == 'd'. Probably, you meant:
>>
>> to_upper( c ) == to_upper( to_lower( c ) )
>
> Yes, that is what I should have written.
>
>>
>>Now, that is something, that I think *should* be true for all values of c.
>>Do you know an instance, where it fails?
>
> I think it easier to give the counter example for:
>
> to_lower(c) == to_lower(to_upper(c));
>
> and consider locales in which lower case accented letters are converted
> to upper case un-accented ones.
>

Thanks, that is convincing. I am forced to reconsider.


Best

Kai-Uwe Bux

Steven T. Hatton

unread,
Aug 22, 2004, 7:04:48 PM8/22/04
to
Ganesh wrote:

> It is a surprise to most of the "common" C++ programmers to learn that
> std::string provides no simple way of doing case-insensitive
> comparison. Before posting this, I referred to:
>
> http://www.freshsources.com/bjarne/ALLISON.HTM
> http://www.josuttis.com/libbook/string/icstring.hpp.html
>
> Given that case-insensitive comparison is such a common operation,
> shouldn't it be made available within C++ standard library instead of
> leaving it to the programmers to re-write such commonly used
> functionality?
>
> -Ganesh

I haven't read all the many replies to this message, so I don't know if this
suggestion has been made yet. It seem to me the reasonable part of the
solution to provide in the Standard is a means for specifying a
comparrision operator as a parameter to be provided by the programmer.
There is absolutely no way the C++ Standard could reasonably be expect to
specify what case insensitive means in every circumstance.

It seems reasonable to expect the C++ Library to support case-insensitive
comparison for the basic source character set defined in ISO/IEC 14882:2003
§2.2 ś1. This is reasonable for at least three reasons. 1) It is the
native characterset of the C++ Programming Language. 2) It is clearly
defined, and simple to implement. 3) It is the character set used by my
native natural language, so it solves all my immediate problems :D . Just
kidding, ;)

There are some further arguments for supporting 'ASCII' (Note that it is not
actually ASCII that is specified in the Standard. ASCII is an encoding
specification. The Standard specifies a character set that maps
isomorphically to ASCII.) For example, many people who use computers in a
culture where ASCII has naught to do with the natural language - Devanagari
for example - have come up with (imperfect) means of encoding these
languages in ASCII for purposes of practical communication such as email.

http://www.ancientscripts.com/devanagari.html

I believe this is the most ambitious effort to create a generalized means of
prgrammatically processing natural languages:
http://oss.software.ibm.com/icu/

A wonderful example of people who use tools of this nature:
http://titus.uni-frankfurt.de/indexe.htm

The following is an obviously relevant source. I know there has been some
effort on the part of the Unicode Consortium to standardize
case-sensitivity.

http://www.unicode.org/

Something I hit on a google while looking for the ICU page:
http://publib.boulder.ibm.com/infocenter/comphelp/index.jsp?topic=/com.ibm.vacpp6a.doc/language/ref/lexical.unicode_standard.htm

--
Regards,
Steven

Niklas Matthies

unread,
Aug 22, 2004, 7:06:52 PM8/22/04
to
On 2004-08-21 03:24, Kai-Uwe Bux wrote:
:
> to_upper( c ) == to_upper( to_lower( c ) )
>
> Now, that is something, that I think *should* be true for all values
> of c. Do you know an instance, where it fails?

It can fail when titlecase characters (such as U+01F2) are considered
to constitute uppercase and to_upper() returns them unchanged.

-- Niklas Matthies

ka...@gabi-soft.fr

unread,
Aug 23, 2004, 6:29:33 PM8/23/04
to
al...@start.no (Alf P. Steinbach) wrote in message
news:<412666be....@news.individual.net>...

> * ka...@gabi-soft.fr:

> > There ARE contexts where case insensitivity makes sense, but the
> > only ones I can think of are when the character set is limited to
> > straight ASCII.

> Most simple text searching operations can involve case insensitity.

Most simple text searching operations should involve case
insensitivity. And do it correctly -- "Maße" matches "MASSE" (or
"MASZE").

Obviously, a simple, character based toupper or to lower function won't
help here.

> File names, process names, etc.

Most of the cases I've seen of these limit the character sets.
Extremely. The one exception is Windows, and I've not had the chance to
see what semantics they assign. Will the filename "Maße" match a file
created with the name of "MASSE"? What happens with 'i', whose upper
case equivalent depends on the language? Etc., etc. Or does Windows
just ignore the accents, so that "parlé" and "parle" are the same
filename. (I think that that is what I would do.)

Note that when all possible characters are allowed, even pure caps can
cause problems. For example, can you tell the difference between "AB"
and "\u0391\u0392" -- I don't know of any font where they are
distinguishable.

> Usually my own case insensitive searches (as a computer user) require

> at least 8259-1, since ASCII doesn't have the Norwegian æøåÆØÅ, or


> UCS2, since many commonly used characters such as m-dash and euro are
> not in the basic Latin-1 set.

I think that there is a typo somewhere in that paragraph, since Latin-1
is ISO 8859-1 (and I don't think that there is such a thing as ISO 8259,
although it seems frequently referenced). But note that you are already
introduced locale dependencies. In Norwegian (I think -- at least in
Danish and Swedish), letters like ø or æ ARE distinct letters, with
their own place in the alphabet. Not all languages treat accented
letters this way, and of course, as I've already mentionned, in Turkish,
'I' is NOT the upper case form of 'i' -- they are two distinct letters.
(Normally, Turkish would use Latin-3, where the capital of 'i' has the
code 0xA9. In Unicode, it would be \u0130.)

> It's no big deal to support this limited functionality, but the idea
> that software simply shouldn't work if it cannot support all potential
> cases is not that far-fetched -- because there's much actual software
> that behaves that way!

How true:-).

In fact, I'm not so much against the idea of standardizing something, as
I am against standardizing it now, when we don't yet know what the
correct solution is (asuming there is one).

> For example, many of Microsoft's C++ development tools have
> traditionally only worked 100% in Seattle/Redmond;

And those from Sun in California:-). And so on. So we standardize bad
practices, on the grounds that they are wide-spread?

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Alf P. Steinbach

unread,
Aug 24, 2004, 2:38:48 AM8/24/04
to
* ka...@gabi-soft.fr:

> al...@start.no (Alf P. Steinbach) wrote in message
> news:<412666be....@news.individual.net>...
>
> > * ka...@gabi-soft.fr:
>
> > > There ARE contexts where case insensitivity makes sense, but the
> > > only ones I can think of are when the character set is limited to
> > > straight ASCII.
>
> > Most simple text searching operations can involve case insensitity.
>
> Most simple text searching operations should involve case
> insensitivity. And do it correctly -- "Maße" matches "MASSE" (or
> "MASZE").
>
> Obviously, a simple, character based toupper or to lower function won't
> help here.

Obviously it does help, because that's what I & a zillion computer users use
today and find very helpful... ;-)

When it's _simple_ enough that the user can understand it fully, the user
can supply the intelligence that is seems you'd like it to have, and as of
2004 any intelligent design places the req. of intelligence on the user.

When it's complex & intelligent enough to handle most such cases it won't be
simple enough to understand (so the user cannot then know and work around
limitations), furthermore it probably won't be there at all...

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Julie

unread,
Aug 24, 2004, 2:39:53 AM8/24/04
to
ka...@gabi-soft.fr wrote:
>
> al...@start.no (Alf P. Steinbach) wrote in message
> news:<412666be....@news.individual.net>...
>
> > * ka...@gabi-soft.fr:
>
> > > There ARE contexts where case insensitivity makes sense, but the
> > > only ones I can think of are when the character set is limited to
> > > straight ASCII.
>
> > Most simple text searching operations can involve case insensitity.
>
> Most simple text searching operations should involve case
> insensitivity. And do it correctly -- "Maße" matches "MASSE" (or
> "MASZE").

Sorry, but I'm going to have to interject here --

Let's back up for a sec -- the things that we are dealing with are called
'strings' right -- and that is just a shorted term for an 'array of
characters'. We aren't dealing with some type called 'word', so what happens
at the word level has absolutely no relevance.

If a character in a character set has an upper case and lower case equivalent,
then that is used in case transformation. If it doesn't, the there isn't
transformed. Plain, simple, and quite easy to understand.

Forget this MASSE word. Speak strictly on the ß character:

What is the upper case _character_ of ß?

What is the lower case _character_ of ß?

If the answer is that there isn't an upper or lower case, then:

ß == toupper('ß') && ß == tolower('ß')

*regardless* of adjacent characters (that may be interpreted as a 'word').

Anthony Williams

unread,
Aug 24, 2004, 3:47:30 PM8/24/04
to
Julie <ju...@nospam.com> writes:

> ka...@gabi-soft.fr wrote:
> >
> > al...@start.no (Alf P. Steinbach) wrote in message
> > news:<412666be....@news.individual.net>...
> >
> > > * ka...@gabi-soft.fr:
> >
> > > > There ARE contexts where case insensitivity makes sense, but the
> > > > only ones I can think of are when the character set is limited to
> > > > straight ASCII.
> >
> > > Most simple text searching operations can involve case insensitity.
> >
> > Most simple text searching operations should involve case
> > insensitivity. And do it correctly -- "Maße" matches "MASSE" (or
> > "MASZE").
>
> Sorry, but I'm going to have to interject here --
>
> Let's back up for a sec -- the things that we are dealing with are called
> 'strings' right -- and that is just a shorted term for an 'array of
> characters'. We aren't dealing with some type called 'word', so what happens
> at the word level has absolutely no relevance.
>
> If a character in a character set has an upper case and lower case
> equivalent, then that is used in case transformation. If it doesn't, the
> there isn't transformed. Plain, simple, and quite easy to understand.

There are cases where the upper or lower case equivalents are not a single
character. There are also cases where the transformation is not reversible,
since there are multiple lower case characters with the same upper case
character.

There are also cases where the upper or lower case equivalent depends on the
current locale, and/or the context of the rest of the word/sentence ---
e.g. lower case sigma is different at the end of a word to in the middle, and
the upper case character for 'i' depends on the language.

If you mean to disregard all these cases, then you *can* define a simplified
toupper/tolower, where characters without a *simple* translation are left
as-is. Whether this is then useful is another question.

> Forget this MASSE word. Speak strictly on the ß character:
>
> What is the upper case _character_ of ß?

_two_ characters --- SS

> What is the lower case _character_ of ß?

ß is lower case.

Anthony
--
Anthony Williams
Senior Software Engineer, Beran Instruments Ltd.

James Hopkin

unread,
Aug 24, 2004, 6:35:02 PM8/24/04
to
Julie <ju...@nospam.com> wrote in message news:<412A96BF...@nospam.com>...

>
> What is the upper case _character_ of ß?
>
> What is the lower case _character_ of ß?
>

The answers to your questions are upper case: SS, lower case: ß

At least, I believe that's most common, if not obligatory.

Similarly, ö is often capitalised as OE.

That being the case in German, I can well believe there are similar
situations in other alphabetic languages, where there isn't a
one-to-one mapping between lower- and upper-case.


James

ka...@gabi-soft.fr

unread,
Aug 24, 2004, 6:41:32 PM8/24/04
to
Julie <ju...@nospam.com> wrote in message
news:<412A96BF...@nospam.com>...
> ka...@gabi-soft.fr wrote:

> > al...@start.no (Alf P. Steinbach) wrote in message
> > news:<412666be....@news.individual.net>...

> > > * ka...@gabi-soft.fr:

> > > > There ARE contexts where case insensitivity makes sense, but
> > > > the only ones I can think of are when the character set is
> > > > limited to straight ASCII.

> > > Most simple text searching operations can involve case
> > > insensitity.

> > Most simple text searching operations should involve case
> > insensitivity. And do it correctly -- "Maße" matches "MASSE" (or
> > "MASZE").

> Sorry, but I'm going to have to interject here --

> Let's back up for a sec -- the things that we are dealing with are
> called 'strings' right -- and that is just a shorted term for an
> 'array of characters'.

Actually, in C++, it's just a standard term for an array of small
integers. C++ doesn't have a character type.

> We aren't dealing with some type called 'word', so what happens at
> the word level has absolutely no relevance.

The question is: are we or are we not dealing with text? If we are
dealing with text, then we treat it as text. If we aren't, then what do
upper and lower case mean?

> If a character in a character set has an upper case and lower case
> equivalent, then that is used in case transformation. If it doesn't,

> there isn't transformed. Plain, simple, and quite easy to understand.

For a programmer. For a user who does a search on "Maße", and doesn't
find "MASSE", it's impossible to understand.

> Forget this MASSE word. Speak strictly on the ß character:

> What is the upper case _character_ of ß?

"SS". Or "SZ", in some contextes, but I suspect that you could get away
with "SS".

> What is the lower case _character_ of ß?

"ß"

> If the answer is that there isn't an upper or lower case,

If the answer is that the alphabet in question doesn't have case, then
there is no problem. The problem is that 'ß' is lower case, and that
it's upper case equivalent requires two characters.

(Actually, it's more subtle than that. At least one character set has a
character 'SS', a single character than when typeset looks exactly like
two S's. In that character set, toupper( 'ß' ) works.)

> then:

> ß == toupper('ß') && ß == tolower('ß')

> *regardless* of adjacent characters (that may be interpreted as a
> 'word').

It has nothing to do with words. It has to do with the fact that the
upper case variant of one letter might require more than one letter.
There's also the fact that the upper case equivalent is definitly locale
specific -- different locales have different rules.

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

0 new messages