Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

is there a problem in the standard regarding user-defined codecvt facets

0 views
Skip to first unread message

ys

unread,
Feb 27, 2002, 4:51:38 PM2/27/02
to
I needed to write a code conversion facet recently. I learned a
little about code conversion facets in the process. In particular,
this was a unicode code conversion facet, and now, I also know where
there are such publicly available ones.

The writing process consisted of two phases: 1) writing it by the
books, and 2) changing everything from the "books" way so that it
would work. My platform in this case is Visual C++ 6.0 SP5, w/ the
standard (Dinkumware) STL and also STLPort (part of a process to make
my code able to make use of STLPort's debug facilities).

The books way. I originally used what appears to be part of the
Roguewave manual, but a similar description is given in a much more
authoritative place - Bjarne Stroustrup's "Appendix D: Locales"
available at http://www.research.att.com/~bs/3rd_loc0.html pages 925 -
928. Particularly, it describes that to write a code conversion facet
one creates a State class. Additionally, not explicated there from
what I can see off-hand, one needs to define a traits type, to
instantiate the stream by, since the char_traits defines the
appropriate "state_type" typedef. This is all fine, until one
realizes that the standard (I don't have the standard, but I'm basing
myself on the standard draft provided at codeproject.com) defines
several facets (num_get, num_put, time_get, time_get_byname, time_put,
time_put_byname, money_get, money_put) in terms of input and output
buffer iterators, which are in turn templated by char_traits. So if
one redefines the state type, one has to redefine the char_traits
type, which means in turn to re-instantiate num_get and num_put and
everything else you might need, and add them to the locale. This is
substantially more complex, than the simple example statement given by
Stroustrup in the above mentioned article:

locale ulocale (locale(), new Cvt_to_upper);
cin.imbue (ulocale);
char ch;
while (cin >> ch) cout << ch;

Understandably, Cvt_to_upper here is a simple conversion to upper
case, something that needs no state, and hence can use the
implementation-defined mbstate_t, allowing no additional facets.
However, why should there be a dependency between this facet and other
facets? Why would creating a code conversion facet (say, for
encryption purposes, which need a very complex user-defined state
type), require you to add to the locale a lot of other relatively
unrelated facets?

Anyway, for my simple code this was too much. So I went back and
worked with mbstate_t, which, according to the standard is defined in
terms of an implementation-defined set of encoding rules (21.1.4.1.5)
which seems to mean that it itself is implementation-defined. Luckily
for me, this turns out to be an int in both cases (STLPort,
Dinkumware's STL), so that it wasn't that much of a problem. I also
know now that better people than me have made use of it for precisely
the same purpose (a unicode conversion facet): P. J. Plauger in
http://www.baosys.com/work/cpp/html/17.04/plauger/plauger.htm , and
some submissions to boost.

On the face of it it seems to me to be a problem with the definition
of STL. Facets are not supposed to be dependent on each other.
They're intended to work as interchangeable orthogonal building
blocks, so I shouldn't have to depend on an implementation defined
type to create a code conversion facet. Someone should be able to
imbue a code conversion facet from one library module and to imbue a
num_get facet from another library module, and they should work
independently with no dependence of one module on another. It seems
to me, not having studied this problem greatly, that the basic problem
lies in mbstate_t being part of char_traits. It shouldn't be. It
should be part of a conversion_traits or stream_traits, maybe,
although it might not be so simple.

I have seen people who wrote about this, noting that because of the
use of mbstate_t, the code won't be portable (so why use STL if the
code isn't portable). But I've not found any mention of it in the
working group issues list or anywhere that might indicate that this is
a problem to be addressed.

Maybe this is all just part of a big misunderstanding on my part, and
I'd appreciate being enlightened in this issue either way.

[ Send an empty e-mail to c++-...@netlab.cs.rpi.edu for info ]
[ about comp.lang.c++.moderated. First time posters: do this! ]

P.J. Plauger

unread,
Feb 28, 2002, 7:48:00 AM2/28/02
to
"ys" <ysa...@yahoo.com> wrote in message news:23be9fc5.02022...@posting.google.com...

> I needed to write a code conversion facet recently. I learned a
> little about code conversion facets in the process. In particular,
> this was a unicode code conversion facet, and now, I also know where
> there are such publicly available ones.
>
> The writing process consisted of two phases: 1) writing it by the
> books, and 2) changing everything from the "books" way so that it
> would work. My platform in this case is Visual C++ 6.0 SP5, w/ the
> standard (Dinkumware) STL and also STLPort (part of a process to make
> my code able to make use of STLPort's debug facilities).

You'll be please to learn that we're including comparable debugging
facilities in an upcoming release.

> The books way. I originally used what appears to be part of the
> Roguewave manual, but a similar description is given in a much more
> authoritative place - Bjarne Stroustrup's "Appendix D: Locales"
> available at http://www.research.att.com/~bs/3rd_loc0.html pages 925 -
> 928. Particularly, it describes that to write a code conversion facet
> one creates a State class. Additionally, not explicated there from
> what I can see off-hand, one needs to define a traits type, to
> instantiate the stream by, since the char_traits defines the
> appropriate "state_type" typedef. This is all fine, until one
> realizes that the standard (I don't have the standard, but I'm basing
> myself on the standard draft provided at codeproject.com) defines
> several facets (num_get, num_put, time_get, time_get_byname, time_put,
> time_put_byname, money_get, money_put) in terms of input and output
> buffer iterators, which are in turn templated by char_traits. So if
> one redefines the state type, one has to redefine the char_traits
> type, which means in turn to re-instantiate num_get and num_put and
> everything else you might need, and add them to the locale.

Not necessarily. But you're right about what you have to do to smuggle
a state class through to a codecvt facet.

> This is
> substantially more complex, than the simple example statement given by
> Stroustrup in the above mentioned article:
>
> locale ulocale (locale(), new Cvt_to_upper);
> cin.imbue (ulocale);
> char ch;
> while (cin >> ch) cout << ch;

Yep. Writing real-life working facets is MUCH harder than most textbooks
imply.

> Understandably, Cvt_to_upper here is a simple conversion to upper
> case, something that needs no state, and hence can use the
> implementation-defined mbstate_t, allowing no additional facets.
> However, why should there be a dependency between this facet and other
> facets? Why would creating a code conversion facet (say, for
> encryption purposes, which need a very complex user-defined state
> type), require you to add to the locale a lot of other relatively
> unrelated facets?

Because the design was proposed before it was ever tried, and it was
standardized before it was widely implemented.

> Anyway, for my simple code this was too much. So I went back and
> worked with mbstate_t, which, according to the standard is defined in
> terms of an implementation-defined set of encoding rules (21.1.4.1.5)
> which seems to mean that it itself is implementation-defined. Luckily
> for me, this turns out to be an int in both cases (STLPort,
> Dinkumware's STL), so that it wasn't that much of a problem. I also
> know now that better people than me have made use of it for precisely
> the same purpose (a unicode conversion facet): P. J. Plauger in
> http://www.baosys.com/work/cpp/html/17.04/plauger/plauger.htm , and
> some submissions to boost.

I've gotten even smarmier in my old age. Wherever possible, I treat the
existing mbstate_t as a simple sequence of bytes. A well placed assert
ensures that I've got enough bytes on a given implementation. With
tricks like these, I've managed to write a number of codecvt facets
that seem to work properly with a variety of library implementations.
But then, I'm an untrained professional...

> On the face of it it seems to me to be a problem with the definition
> of STL. Facets are not supposed to be dependent on each other.
> They're intended to work as interchangeable orthogonal building
> blocks, so I shouldn't have to depend on an implementation defined
> type to create a code conversion facet. Someone should be able to
> imbue a code conversion facet from one library module and to imbue a
> num_get facet from another library module, and they should work
> independently with no dependence of one module on another. It seems
> to me, not having studied this problem greatly, that the basic problem
> lies in mbstate_t being part of char_traits. It shouldn't be. It
> should be part of a conversion_traits or stream_traits, maybe,
> although it might not be so simple.

Uh huh. It's not quite that simple, but you're pretty close. Some
drafts of the C++ Standard even looked a bit like what you suggest.
And some people (who shall remain nameless) suggested several
changes to the facets/traits/state machinery in this direction,
FWIW.

> I have seen people who wrote about this, noting that because of the
> use of mbstate_t, the code won't be portable (so why use STL if the
> code isn't portable). But I've not found any mention of it in the
> working group issues list or anywhere that might indicate that this is
> a problem to be addressed.

It's hard to know where to start. Some of us are content to produce
useful code that works more or less within the (fuzzy) confines of
the existing C++ Standard.

> Maybe this is all just part of a big misunderstanding on my part, and
> I'd appreciate being enlightened in this issue either way.

Gee, I wish you were half as dumb as you fear you might be. But you're
not. Nice summary.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com

0 new messages