Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

how do I lowercase a string??

1 view
Skip to first unread message

Gary Kushner

unread,
Mar 14, 1997, 3:00:00 AM3/14/97
to

This is how I lowercase a string:

void lower(char& c)
{
c = tolower(c);
}

for_each(s.begin(), s.end(), lower);

---

There must be a more direct way. What am I missing.

Also, does anyone know of any new books on the standard library?

tia,
-Gary


[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]
[ about comp.lang.c++.moderated. First time posters: do this! ]


Gary Kushner

unread,
Mar 14, 1997, 3:00:00 AM3/14/97
to

o.k., I found:

transform(s.begin(), s.end(), tolower);

but it still seems like maybe there should be a more direct approach.

Fergus Henderson

unread,
Mar 15, 1997, 3:00:00 AM3/15/97
to

kus...@i.b.m.net (Gary Kushner) wrote

>transform(s.begin(), s.end(), tolower);

That should be

transform(s.begin(), s.end(), s.begin(), tolower);
^^^^^^^^^^

Furthermore, it is buggy; it may fail on systems where `char' is signed.
You need to cast the argument to tolower() to `unsigned char'.

>>void lower(char& c)
>>{
>> c = tolower(c);
>>}
>>
>>for_each(s.begin(), s.end(), lower);

This has the same problem.

I don't know of any more direct approach than

unsigned char my_tolower(unsigned char c) { return tolower(c); }
...
transform(s.begin(), s.end(), s.begin(), my_tolower);

or

void make_lower(char & c) { c = tolower((unsigned char) c); }
...
for_each(s.begin(), s.end(), make_lower);

I'm not entirely sure that the latter is strictly conforming, so
I would use the former.

--
Fergus Henderson <f...@cs.mu.oz.au> | "I have always known that the
pursuit
WWW: <http://www.cs.mu.oz.au/~fjh> | of excellence is a lethal habit"
PGP: finger f...@128.250.37.3 | -- the last words of T. S.
Garp.

Dietmar Kuehl

unread,
Mar 16, 1997, 3:00:00 AM3/16/97
to

Hi,
Gary Kushner (kus...@i.b.m.net) wrote:
: transform(s.begin(), s.end(), tolower);

It has already been pointed out that this should actually read

transform(s.begin(), s.end(), s.begin(), tolower);

Not being an American, this solution is not satisfactory for me: For
example, the second letter of my name "Kuehl" where the "ue" is
actually considered to be just one later, namely an u-umlaut or
"udieresis" (the PostScript name), is handled incorrectly (this letter
is in general handled incorrectly such that I replaced it with the
sequence 'ue', the common transcription...). It is not considered to be
a letter and is thus not subject to conversion by 'toupper()' or
'tolower()'. A more reasonable approach is to use the 'ctype' facet of
locales to make the conversion:

#include <locale>
#include <string>
#include <algorithm>
using namespace std;

template <class charT>
struct lower
{
lower(): facet(use_facet<ctype<charT> >(locale())) {}
charT operator()(charT c) const { return facet.tolower(c); }
private:
ctype<charT> const &facet;
};

template <class charT>
void tolower(basic_string<charT> &str)
{
transform(str.begin(), str.end(), str.begin(), lower());
}

: but it still seems like maybe there should be a more direct approach.

Well, my approach is even more indirect but IMO more appropriate...
Looking at the locales library, you will see that C-strings are
preferred in some places over string classes using an iterator
abstraction. For example, you can do the following:

void tolower(char *begin, char const *end)
{
use_facet<ctype<char> >(locale()).tolower(begin, end);
}

While you cannot do the same for other iterators! The problem here is
that it would be desirable to have a member template function
'tolower()' in the class 'ctype'. Looking at the current implementation
of 'tolower()' it becomes apparent, that it is impossible to use a
member template functions: 'tolower()' just dispatches the request to
'do_tolower()' which is a virtual function. If 'tolower()' were a
member template function, 'do_tolower()' would also have to be a member
template function in addition to being virtual. But it is not allowed
for a member template function to be virtual...

The only chance to remove this problem would be to add another template
argument to facets like 'ctype' (basically the same problem occures to
other facets, too):

template <class charT, class FwdIterator = charT*>
class ctype
{
public:
//...
FwdIterator tolower(FwdIterator begin, FwdIterator end) const;
};

With a defintion like this, you would be able to turn a string into all
lower cases with this code:

template <class charT>
void tolower(basic_string<charT> &str)
{
typedef ctype<charT, basic_string<charT>::iterator> Facet;
use_facet<Facet>(locale()).tolower(str.begin(), str.end());
}

yielding satisfactory results for everybody, including non-Americans
using some extended character set... I would expect that this works
with most implementations anyway since 'basic_string<charT>::iterator'
is normally defined as 'charT*'. However, this is not guaranteed.

Although this has no longer anything to do with the original problem, I
just want to point out that the above approach can be made to work even
with those members of 'ctype' which currently return a 'const charT*':
If a typedef 'const_iterator' is added to 'iterator_traits<T>' which is
defined to be 'T' if 'T' is already the constant iterator or if there
is no other iterator and the constant iterator otherwise, eg. the
'scan_is()' member could become this:

template <class charT, class FwdIterator = charT*>
class ctype
{
typedef typename iterator_traits<FwdIterator>::const_iterator
const_iterator;
public:
const_iterator
scan_is(mask m, const_iterator beg, const_iterator end) const;
//...
};

Sure, this is a little bit more complex but it would no longer favour
C-strings instead of C++-strings.
--
<mailto:dietma...@uni-konstanz.de>
<http://www.informatik.uni-konstanz.de/~kuehl/>
I am a realistic optimist - that's why I appear to be slightly pessimistic

James Kanze

unread,
Mar 17, 1997, 3:00:00 AM3/17/97
to

ku...@uzwil.informatik.uni-konstanz.de (Dietmar Kuehl) writes:

|> Gary Kushner (kus...@i.b.m.net) wrote:
|> : transform(s.begin(), s.end(), tolower);
|>
|> It has already been pointed out that this should actually read
|>
|> transform(s.begin(), s.end(), s.begin(), tolower);
|>
|> Not being an American, this solution is not satisfactory for me: For
|> example, the second letter of my name "Kuehl" where the "ue" is
|> actually considered to be just one later, namely an u-umlaut or
|> "udieresis" (the PostScript name), is handled incorrectly (this letter
|> is in general handled incorrectly such that I replaced it with the
|> sequence 'ue', the common transcription...). It is not considered to be
|> a letter and is thus not subject to conversion by 'toupper()' or
|> 'tolower()'.

Have you called setlocale first? This should work, and does, at least
on the Sun implementation (both Sun CC and g++). (This is the only
implementation I've tried lately; my HP doesn't have ISO 8859-1 support
installed.)

[... rest deleted]

--
James Kanze home: ka...@gabi-soft.fr +33 (0)1 39 55 85 62
office: ka...@vx.cit.alcatel.fr +33 (0)1 69 63 14 54
GABI Software, Sarl., 22 rue Jacques-Lemercier, F-78000 Versailles France
-- Conseils en informatique industrielle --

Gary Kushner

unread,
Mar 17, 1997, 3:00:00 AM3/17/97
to

>That should be
>
> transform(s.begin(), s.end(), s.begin(), tolower);

true enough, that was just a typo...

>Furthermore, it is buggy; it may fail on systems where `char' is signed.
>You need to cast the argument to tolower() to `unsigned char'.

Why is it buggy? As long as it is given ascii, and if not isn't
tolower always undefined?

>I don't know of any more direct approach than

I sure haven't found anything--still think it's strange though. I
sure wouldn't mind a good book the the SL.

thanks for the reply.

-Gary

Gary Kushner

unread,
Mar 17, 1997, 3:00:00 AM3/17/97
to

I posted a response (mostly a thanks for the help), but it doesn't
seem to have shown up? I know ibm is having lots of email/news
problems so maybe it went to the bitbucket... -Gary

On 15 Mar 1997 17:55:31 -0500, f...@mundook.cs.mu.OZ.AU (Fergus
Henderson) wrote:

>kus...@i.b.m.net (Gary Kushner) wrote
>
>>transform(s.begin(), s.end(), tolower);
>

>That should be
>
> transform(s.begin(), s.end(), s.begin(), tolower);

> ^^^^^^^^^^


>
>Furthermore, it is buggy; it may fail on systems where `char' is signed.
>You need to cast the argument to tolower() to `unsigned char'.
>

>>>void lower(char& c)
>>>{
>>> c = tolower(c);
>>>}
>>>
>>>for_each(s.begin(), s.end(), lower);
>
>This has the same problem.
>

>I don't know of any more direct approach than
>

> unsigned char my_tolower(unsigned char c) { return tolower(c); }
> ...
> transform(s.begin(), s.end(), s.begin(), my_tolower);
>
>or
>
> void make_lower(char & c) { c = tolower((unsigned char) c); }
> ...
> for_each(s.begin(), s.end(), make_lower);
>
>I'm not entirely sure that the latter is strictly conforming, so
>I would use the former.
>
>--

{Sig and banner trimmed... please don't overquote. -mod}

Gary Kushner

unread,
Mar 18, 1997, 3:00:00 AM3/18/97
to

>It has already been pointed out that this should actually read
>
> transform(s.begin(), s.end(), s.begin(), tolower);
>

Right, just a typo on my part. And thank you for the comments, they
were very helpful, but I think they do point out a weakness in the SL.
Should every c++ programmer who wants to lowercase a string be
required to know details of the STL? Sure the STL is a good bunch of
general purpose (w/ ref, but not much in terms of user guides) tools,
but just because the STL can accomplish something, does that mean no
other method should exist? I think it makes c++ (at least the SL) a
very intimidating language to learn.

-Gary

Fergus Henderson

unread,
Mar 18, 1997, 3:00:00 AM3/18/97
to

kus...@i.b.m.net (Gary Kushner) writes:

>>That should be
>>
>> transform(s.begin(), s.end(), s.begin(), tolower);
>

>true enough, that was just a typo...
>

>>Furthermore, it is buggy; it may fail on systems where `char' is signed.
>>You need to cast the argument to tolower() to `unsigned char'.
>

>Why is it buggy?

Because if the argument to tolower() is negative (other than EOF),
then then behaviour is undefined.

>As long as it is given ascii, and if not isn't
>tolower always undefined?

ASCII has nothing to do with it -- neither the C standard nor
the C++ draft require ASCII, and there are implementations that
use EBCDIC, for example.

The behaviour of tolower() is defined if the argument's value is
the value of an unsigned char, or EOF. But if the argument is
negative, the behaviour is undefined. A common implementation
technique for tolower() is to use something along the lines of

#define tolower(c) (__chartab[(c) + 1] & __LOWERCASE)

where __chartab is an array of length CHARMAX + 1. Clearly
this will produce bogus results when the value passed is negative.

--
Fergus Henderson <f...@cs.mu.oz.au> | "I have always known that the
pursuit
WWW: <http://www.cs.mu.oz.au/~fjh> | of excellence is a lethal habit"
PGP: finger f...@128.250.37.3 | -- the last words of T. S.
Garp.

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

James Kanze

unread,
Mar 19, 1997, 3:00:00 AM3/19/97
to

kus...@i.b.m.net (Gary Kushner) writes:

|> >That should be
|> >
|> > transform(s.begin(), s.end(), s.begin(), tolower);
|>
|> true enough, that was just a typo...
|>
|> >Furthermore, it is buggy; it may fail on systems where `char' is signed.
|> >You need to cast the argument to tolower() to `unsigned char'.
|>

|> Why is it buggy? As long as it is given ascii, and if not isn't
|> tolower always undefined?

The problem is that it won't always be given ASCII; if the source is of
type "char", you can't guarantee it, and in Dietmar's (and my) case,
non-ASCII characters are rather frequent in typical input.

This has been hashed out recently on comp.std.c++:

1. tolower takes an int as argument, it is only defined, however, for
values in the range 0...UCHAR_MAX and EOF.

2. Whether plain char is signed or not is implementation defined. You
cannot count on it being unsigned in portable code, and in fact, the
signed variant seems to predominate.

3. If plain char is signed, some of its values will be negative. When
passed to tolower, they will be converted to int, with NO loss of value;
i.e. they will still be negative.

The most typical situation today would seem to be: 8 bit signed char's,
with ISO 8859-1 character codes. This means that some of the actual
character codes that will be read *will* have negative values, even if
you are reading straight text.

4. Conclusion: you cannot safely pass a char to tolower, since some of
the possible values will result in undefined behavior. You must first
cast it to unsigned char.

--
James Kanze home: ka...@gabi-soft.fr +33 (0)1 39 55 85 62
office: ka...@vx.cit.alcatel.fr +33 (0)1 69 63 14 54
GABI Software, Sarl., 22 rue Jacques-Lemercier, F-78000 Versailles France
-- Conseils en informatique industrielle --

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Gary Kushner

unread,
Mar 20, 1997, 3:00:00 AM3/20/97
to

>where __chartab is an array of length CHARMAX + 1. Clearly
>this will produce bogus results when the value passed is negative.

True enough. Thanks for the info. When I write code to be portable
I'm going to have to be more careful than I have been.

-Gary

Gary Kushner

unread,
Mar 20, 1997, 3:00:00 AM3/20/97
to

On 19 Mar 1997 02:25:16 -0500, James Kanze
<james-alb...@vx.cit.alcatel.fr> wrote:

>4. Conclusion: you cannot safely pass a char to tolower, since some of
>the possible values will result in undefined behavior. You must first
>cast it to unsigned char.

Well, thanks for the info. The current project is just for in-house,
so I'm not going to worry about it now, but it is good to know.

Tony Cook

unread,
Mar 21, 1997, 3:00:00 AM3/21/97
to

[second attempt at a similar article - the first seems to have been
lost somewhere]

Dietmar Kuehl (ku...@uzwil.informatik.uni-konstanz.de) wrote:
: Not being an American, ...

...

: template <class charT, class FwdIterator = charT*>


: class ctype
: {
: public:
: //...
: FwdIterator tolower(FwdIterator begin, FwdIterator end) const;
: };

: With a defintion like this, you would be able to turn a string into all
: lower cases with this code:

: template <class charT>
: void tolower(basic_string<charT> &str)
: {
: typedef ctype<charT, basic_string<charT>::iterator> Facet;
: use_facet<Facet>(locale()).tolower(str.begin(), str.end());
: }

: yielding satisfactory results for everybody, including non-Americans
: using some extended character set... I would expect that this works
: with most implementations anyway since 'basic_string<charT>::iterator'
: is normally defined as 'charT*'. However, this is not guaranteed.

Will this also work for MBCS?

If basic_string<charT>::iterator is defined as charT* then I can't
see how it would, as it may attempt to lower-case a trail byte in a
multi-byte character.

If basic_string<charT>::iterator is meant to support MBCS then I
can't see how it can be a bidirectional iterator, which would be
required since basic_string<charT>::reverse_iterator would depend on
this.

--
Tony Cook - to...@online.tmx.com.au
10023...@compuserve.com

Bradd W. Szonye

unread,
Mar 21, 1997, 3:00:00 AM3/21/97
to

Tony Cook <to...@online.tmx.com.au> wrote in article
<5gtnjk$6...@netlab.cs.rpi.edu>...

>
> Will this also work for MBCS?
>
> If basic_string<charT>::iterator is defined as charT* then I can't
> see how it would, as it may attempt to lower-case a trail byte in a
> multi-byte character.
>
> If basic_string<charT>::iterator is meant to support MBCS then I
> can't see how it can be a bidirectional iterator, which would be
> required since basic_string<charT>::reverse_iterator would depend on
> this.

Transformations on NTCS's will generally work only if they are not also
NTMBS's. Most of the transformations are defined to work on single
characters only. For the desired effect, you should first transform any
NTMBS into an equivalent NTCS where the character size is sufficient to
represent any multibyte-character with a single (larger) character. Usually
this will involve using the codecvt facet to convert, say, a
basic_string<char> to a basic_string<wchar_t> or basic_string<int>.

Note that the standard allows you to write narrow string literals which are
NTMBS's, but wide string literals are never NTMCS's--wchar_t must be
capable of representing any character in the extended execution character
set, so it need never be multibyte.

There are some tricky points in dealing with extended characters properly;
for example, filebufs always "see" the external character sequences as
multibyte chars. The best way to fit into the C++ character paradigm--if
you need to worry about multibyte characters--is to

- treat files as multibyte char sequences externally
- treat strings (wstrings actually) as single-character wchar_t sequences
internally

That works fairly well with, for example, the Windows NT view of the world,
where characters are MBS externally (to save space) but WCS internally (for
efficient processing).

I can imagine one very reasonable implementation of all this character
stuff:

- the basic character set is the first 128 characters of UCS-2 (the subset
commonly known as US-ASCII)
- the multibyte extended encoding is UTF-8
- the extended character set is UCS-2

Then, basic_ios<> uses UTF-8 for storing external character sequences;
wchar_t is 16 bits and unsigned; char is 8 bits, with sign unspecified
(either signed for compatibility or unsigned for consistency with wchar_t);
basic_string<char> is a UTF-8 encoding of a string; basic_string<wchar_t>
is a UCS-2 encoding of a string; and basic_string<int> is a UCS-4 encoding
of a string (for those rare instances when you need to convert to/from an
external UCS-4 sequence).

One complaint I can see with this implementation is that UTF-8 is really
not compatible with 8859-1 (Latin 1) and its friends. It's a fairly
heavyweight solution for users who normally use the upper half of 8859-1;
the tradeoff is consistency and extensibility for ease of specifying
\u0080-\u00FF (which are two-byte characters under UTF-8).
--
Bradd W. Szonye
bra...@concentric.net

James Kanze

unread,
Mar 24, 1997, 3:00:00 AM3/24/97
to

"Bradd W. Szonye" <bra...@concentric.net> writes:

|> There are some tricky points in dealing with extended characters
properly;
|> for example, filebufs always "see" the external character sequences
as
|> multibyte chars.

Why? There are two ways to handle this: use different traits classes
for different external character codes, or change behavior according to
the imbued locale.

|> The best way to fit into the C++ character paradigm--if
|> you need to worry about multibyte characters--is to
|>
|> - treat files as multibyte char sequences externally
|> - treat strings (wstrings actually) as single-character wchar_t
sequences
|> internally
|>
|> That works fairly well with, for example, the Windows NT view of the
world,
|> where characters are MBS externally (to save space) but WCS
internally (for
|> efficient processing).
|>
|> I can imagine one very reasonable implementation of all this
character
|> stuff:
|>
|> - the basic character set is the first 128 characters of UCS-2 (the
subset
|> commonly known as US-ASCII)
|> - the multibyte extended encoding is UTF-8
|> - the extended character set is UCS-2

This sounds close to what is done in Plan 9.

|> Then, basic_ios<> uses UTF-8 for storing external character
sequences;
|> wchar_t is 16 bits and unsigned; char is 8 bits, with sign
unspecified
|> (either signed for compatibility or unsigned for consistency with
wchar_t);
|> basic_string<char> is a UTF-8 encoding of a string;
basic_string<wchar_t>
|> is a UCS-2 encoding of a string; and basic_string<int> is a UCS-4
encoding
|> of a string (for those rare instances when you need to convert
to/from an
|> external UCS-4 sequence).
|>
|> One complaint I can see with this implementation is that UTF-8 is
really
|> not compatible with 8859-1 (Latin 1) and its friends. It's a fairly
|> heavyweight solution for users who normally use the upper half of
8859-1;
|> the tradeoff is consistency and extensibility for ease of specifying
|> \u0080-\u00FF (which are two-byte characters under UTF-8).

I think it essential to be able to read/write files in 8859-1, and
probably in other 8859 variants as well. On some systems, it will also
be necessary to be able to handle national variants of ISO 546 as well.
And there are several different MBCS encodings in use in China and
Japan.

I've not yet had time to study the details, but I think that this should
be under the control of imbue somehow.

--
James Kanze home: ka...@gabi-soft.fr +33 (0)1 39 55 85
62
office: ka...@vx.cit.alcatel.fr +33 (0)1 69 63 14
54
GABI Software, Sarl., 22 rue Jacques-Lemercier, F-78000 Versailles
France
-- Conseils en informatique industrielle --

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

Bradd W. Szonye

unread,
Mar 24, 1997, 3:00:00 AM3/24/97
to

James Kanze <james-alb...@vx.cit.alcatel.fr> wrote in article
<5h5ug2$9...@netlab.cs.rpi.edu>...

> "Bradd W. Szonye" <bra...@concentric.net> writes:
>
> |> There are some tricky points in dealing with extended characters
> |> properly; for example, filebufs always "see" the external
> |> character sequences as multibyte chars.
>
> Why? There are two ways to handle this: use different traits classes
> for different external character codes, or change behavior according to
> the imbued locale.

Why? By definition:

[27.8.1:2]
A File provides byte sequences. So the streambuf (or its derived classes)
treats a file as the external source/sink byte sequence. In a large
character set environment, multibyte character sequences are held in files.
In order to provide the contents of a file as wide character sequences,
wide-oriented filebuf... should convert wide character sequences.

In other words, the implementation of wfilebuf converts wide characters to
multibyte sequences on output and converts multibyte characters to wide
characters on input. In particular, most filebuf operations tied to the
external sequence use the codecvt facet to handle the conversion in their
standard definitions. While you can of course imbue your own facet, it must
be a charT-to-char-and-back facet.

The obvious external encoding for the "C" locale, then, is the "C"
multibyte character encoding for that implementation, which is not usually
the same as the wchar_t encoding.

I suppose you *could* write a codecvt in such a way that the multibyte
encoding was indistinguishable from, say, UCS-2 or UCS-4. However, since
those encodings interact badly with single-byte chars (such as those in
string literals), that's unlikely to be the default implementation.



> |> - the basic character set is the first 128 characters of UCS-2 (the
> subset
> |> commonly known as US-ASCII)
> |> - the multibyte extended encoding is UTF-8
> |> - the extended character set is UCS-2
>
> This sounds close to what is done in Plan 9.

Well, if it is reasonable (to at least some audiences), I'm sure I'm not
the only person to have thought of it.



> |> One complaint I can see with this implementation is that UTF-8 is
> really
> |> not compatible with 8859-1 (Latin 1) and its friends. It's a fairly
> |> heavyweight solution for users who normally use the upper half of
> 8859-1;
> |> the tradeoff is consistency and extensibility for ease of specifying
> |> \u0080-\u00FF (which are two-byte characters under UTF-8).
>
> I think it essential to be able to read/write files in 8859-1, and
> probably in other 8859 variants as well. On some systems, it will also
> be necessary to be able to handle national variants of ISO 546 as well.
> And there are several different MBCS encodings in use in China and
> Japan.

I should have been more careful; the default implementation would not be
compatible with 8859-1 and other mappings, but user-written or
vendor-supplied (additional) facets could provide this capability. You
could, for example, create a facet that treats characters x90-xFF as
characters of their own rather than the prefix to a UTF-8 character
sequence.

In fact, the problem wouldn't even be evident until you tried to read an
8859-1 file with a wfilebuf. Since the base defintion of char-to-char
codecvt is a no-op, you wouldn't see any problems working strictly with
chars. Reading the same sequence with a wfilebuf would interpret 8859-1
extended characters as prefix bytes, however, and quickly result in invalid
characters. (Unless, again, you imbued a more appropriate codecvt facet, or
unless I am just plain wrong.)



> I've not yet had time to study the details, but I think that this should
> be under the control of imbue somehow.

It is.


--
Bradd W. Szonye
bra...@concentric.net

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

James Kanze

unread,
Mar 24, 1997, 3:00:00 AM3/24/97
to

kus...@i.b.m.net (Gary Kushner) writes:

|> On 19 Mar 1997 02:25:16 -0500, James Kanze
|> <james-alb...@vx.cit.alcatel.fr> wrote:
|>
|> >4. Conclusion: you cannot safely pass a char to tolower, since some of
|> >the possible values will result in undefined behavior. You must first
|> >cast it to unsigned char.
|>
|> Well, thanks for the info. The current project is just for in-house,
|> so I'm not going to worry about it now, but it is good to know.

I can buy that. "Just for in-house", like a certain OS writen by
Thomson and Richie back in the late sixies/early seventies:-). And of
course, you work in an environment where it is materially impossible for
a user to accidentally connect the wrong file type to your input; your
OS distinguishes between text files and non text files and will not
allow you to write a character with the top bit set into a text file.

If you have a large amount of working code, there may be more urgent
errors to fix. But I would certainly start getting into the habit of
doing it right in new code.

--
James Kanze home: ka...@gabi-soft.fr +33 (0)1 39 55 85 62
office: ka...@vx.cit.alcatel.fr +33 (0)1 69 63 14 54
GABI Software, Sarl., 22 rue Jacques-Lemercier, F-78000 Versailles France
-- Conseils en informatique industrielle --

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]

James Kanze

unread,
Mar 25, 1997, 3:00:00 AM3/25/97
to

"Bradd W. Szonye" <bra...@concentric.net> writes:

|> James Kanze <james-alb...@vx.cit.alcatel.fr> wrote in article
|> <5h5ug2$9...@netlab.cs.rpi.edu>...
|> > "Bradd W. Szonye" <bra...@concentric.net> writes:
|> >
|> > |> There are some tricky points in dealing with extended characters
|> > |> properly; for example, filebufs always "see" the external
|> > |> character sequences as multibyte chars.
|> >
|> > Why? There are two ways to handle this: use different traits classes
|> > for different external character codes, or change behavior according to
|> > the imbued locale.
|>
|> Why? By definition:
|>
|> [27.8.1:2]
|> A File provides byte sequences. So the streambuf (or its derived classes)
|> treats a file as the external source/sink byte sequence. In a large
|> character set environment, multibyte character sequences are held in files.
|> In order to provide the contents of a file as wide character sequences,
|> wide-oriented filebuf... should convert wide character sequences.
|>
|> In other words, the implementation of wfilebuf converts wide characters to
|> multibyte sequences on output and converts multibyte characters to wide
|> characters on input. In particular, most filebuf operations tied to the
|> external sequence use the codecvt facet to handle the conversion in their
|> standard definitions. While you can of course imbue your own facet, it must
|> be a charT-to-char-and-back facet.

Again: I've not studied this too well, so take my comments with a grain
of salt, but...

This seems exactly what is wanted. My external files will generally
have different encodings; a file picked up off a Czeck web site would
probably not use 8859-1, and a file from Japan almost certainly not. I
even have to code translate between my HP at work and my Sun at home.

Note that the above even supports writing Unicode files: all of the
wchar_t map to "multi-byte" characters (with two bytes per character).
In this case, you would probably want to have two different locales,
according to whether you want to read/write the files in big-endian or
in little-endian order. (I seem to recall reading somewhere that all
Unicode files are supposed to start with the non-character 0xfffe, so
that the reader can determine byte order from the first two bytes. I'm
not sure if this could be handled automatically, however.)

|> The obvious external encoding for the "C" locale, then, is the "C"
|> multibyte character encoding for that implementation, which is not usually
|> the same as the wchar_t encoding.

The obvious encoding for the "C" locale is the one in which most of the
files physically present on the machine are in. I do not believe that
there is any requirement that the encoding in the "C" locale be the same
for machines sold in Japan as it is here in Europe.

In practice, I don't think it too important. Very few programs should
use "C" locale, basically, only things like compilers. Most programs
pick up the locale from the environment.

The problem with the C++ scheme is that most programs pick up the locale
once, globally, and not on a per file basis. Certainly, the most common
OS's have no way of associating a locale with an individual file. Once
I've picked up a file from Prague on my local machine, the only
indication that it is not encoded in 8859-1 is in my head.

|> I suppose you *could* write a codecvt in such a way that the multibyte
|> encoding was indistinguishable from, say, UCS-2 or UCS-4. However, since
|> those encodings interact badly with single-byte chars (such as those in
|> string literals), that's unlikely to be the default implementation.

In practice, I think that anyone having to use anything more than 8859-1
should be using wchar_t internally, and wide string literals.
Presumably, the actual encoding, multi-byte or no, used in narrow string
literals will depend on a compiler switch. For practical reasons, any
implementation sold in Western Europe had better support 8859-1 (or
something similar like HP's roman8).

Note that string literals are rather exceptional anyway; in my own code,
they occur almost exclusively as arguments to "gettext". In this
context, they only contain US ASCII, so there is no problem. Some work
will be necessary around gettext itself, since it should (IMHO) return a
wchar_t string.

|> > |> - the basic character set is the first 128 characters of UCS-2 (the
|> > subset
|> > |> commonly known as US-ASCII)
|> > |> - the multibyte extended encoding is UTF-8
|> > |> - the extended character set is UCS-2
|> >
|> > This sounds close to what is done in Plan 9.
|>
|> Well, if it is reasonable (to at least some audiences), I'm sure I'm not
|> the only person to have thought of it.

It's very reasonable, at least for a totally new system. As it is, I
have too many files in 8859-1 already for me to junk them (and
portability outside of Western Europe has never been a consideration to
date). For the immediate future, it would seem a safe guess that Sun
Sparcs sold in Europe will continue to use 8859-1 as the usual (default)
code set, for example.

I would hope that the implementation will provide the necessary facets.
This is part of their job, after all. (The implementations I use
currently provide the necessary locales, and are, in fact, quite
sophisticated about how they do it.) The real problem that I see is how
to go about deciding which facet to imbue; none of the systems I use can
even tell me whether a file is text or binary, much less what encoding
is used for the text. For window-based systems, this probably means an
extra set of radio buttons when specifying the filename, for command
line systems, additional command line options.

0 new messages