Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Q: Convert std::string to std::wstring using std::ctype widen()

132 views
Skip to first unread message

Jeffrey Walton

unread,
Nov 25, 2006, 7:18:32 AM11/25/06
to
Hi All,

I've done a little homework (I've read responses to similar from P.J.
Plauger and Dietmar Kuehl), and wanted to verify with the Group. Over
in comp.lang.c++, I'm getting a lot of boot Microsoft answers (which
does not help me). Below is what I am performing (Stroustrup's Appendix
D recommendations won't compile in Microsoft VC++ 6.0).

My question is in reference to MultiByte Character Sets. Will this code
perform as expected? I understand every problem has a simple and
elegant solution that is wrong.

I generally use US English or Unicode, so I don't encounter a lot of
issues others may see (a multibyte character using std::string). I have
verified it works with a Hello World sample.

Before I get flamed for not using std::codecvt, Stroustrup states
(D.4.6 Character Code Conversion, p 925):
The codecvt facet provides conversion between different character sets
when a character is moved between a stream buffer and external
storage...

Jeff
Jeffrey Walton

std::string s = "Hello World";
std::ctype<wchar_t> ct;
std::wstring ws;

for( std::string::const_iterator it = s.begin();
it != s.end(); it++ )
{
ws += ct.widen( *it );
}

// Stroustrup again (Section D.4.5, p. 923):
// A call widen(c) transforms the character c into its
corresponding Ch value.
// If Ch's character set provides several characters
corresponding to c, the
// standard specifies that "the implest reasonable
transformation" be used.

// http://www.research.att.com/~bs/3rd_loc.pdf
// by Bjourne himself...
// page 28 of the above reference
// or
// The C++ Programming Language, Special Edition
// Section D.4.2.2, p 895 (Full Manual)
//
// const std::locale& loc = s.getloc();
// wchar_t w = std::use_facet< std::ctype<char> >
// (loc).widen(c);
// does not compile in Microsft's VC++ 6.0 environment...
// getloc() is not a member of std::basic_string< ... > ...
//
// wchar_t wc = std::use_facet< std::ctype<wchar_t> >
// (out.getloc()).widen(*it);
// does not compile in Microsft's VC++ 6.0 environment...

//
// Dietmar Kuehl code
// does not compile in Microsft's VC++ 6.0 environment...
//
// std::wstring to_wide_string(std::string const& source) {
// typedef std::ctype<wchar_t> CT;
// std::wstring rc;
// rc.resize(source.size());
// CT const& ct = std::use_facet<CT>(std::locale());
// ct.widen(source.data(), source.data() +
// source.size(), rc.data());
// return rc;

Bo Persson

unread,
Nov 25, 2006, 9:02:42 AM11/25/06
to
Jeffrey Walton wrote:
> Hi All,
>
> I've done a little homework (I've read responses to similar from
> P.J. Plauger and Dietmar Kuehl), and wanted to verify with the
> Group. Over in comp.lang.c++, I'm getting a lot of boot Microsoft
> answers (which does not help me). Below is what I am performing
> (Stroustrup's Appendix D recommendations won't compile in Microsoft
> VC++ 6.0).
>
> My question is in reference to MultiByte Character Sets. Will this
> code perform as expected? I understand every problem has a simple
> and elegant solution that is wrong.
>
> I generally use US English or Unicode, so I don't encounter a lot of
> issues others may see (a multibyte character using std::string). I
> have verified it works with a Hello World sample.
>
> Before I get flamed for not using std::codecvt, Stroustrup states
> (D.4.6 Character Code Conversion, p 925):
> The codecvt facet provides conversion between different character
> sets when a character is moved between a stream buffer and external
> storage...
>
> Jeff
> Jeffrey Walton
>
> std::string s = "Hello World";
> std::ctype<wchar_t> ct;

This is using the default version of ctype, not necessarily the one in the
current locale. That's why the other codes have a use_facet<>() to retrieve
the current active version.

> std::wstring ws;
>
> for( std::string::const_iterator it = s.begin();
> it != s.end(); it++ )
> {
> ws += ct.widen( *it );
> }
>
> // Stroustrup again (Section D.4.5, p. 923):
> // A call widen(c) transforms the character c into its
> corresponding Ch value.
> // If Ch's character set provides several characters
> corresponding to c, the
> // standard specifies that "the implest reasonable
> transformation" be used.
>
> // http://www.research.att.com/~bs/3rd_loc.pdf
> // by Bjourne himself...
> // page 28 of the above reference
> // or
> // The C++ Programming Language, Special Edition
> // Section D.4.2.2, p 895 (Full Manual)
> //
> // const std::locale& loc = s.getloc();
> // wchar_t w = std::use_facet< std::ctype<char> >
> // (loc).widen(c);
> // does not compile in Microsft's VC++ 6.0 environment...
> // getloc() is not a member of std::basic_string< ... > ...

Here s cannot be a string, but a stream. A stream has a locale, a string
does not.

> //
> // wchar_t wc = std::use_facet< std::ctype<wchar_t> >
> // (out.getloc()).widen(*it);
> // does not compile in Microsft's VC++ 6.0 environment...

VC6 has its set of own limitations, especially with templates. Don't use it,
if you don't absolutely have to.

>
> //
> // Dietmar Kuehl code
> // does not compile in Microsft's VC++ 6.0 environment...
> //
> // std::wstring to_wide_string(std::string const& source) {
> // typedef std::ctype<wchar_t> CT;
> // std::wstring rc;
> // rc.resize(source.size());
> // CT const& ct = std::use_facet<CT>(std::locale());
> // ct.widen(source.data(), source.data() +
> // source.size(), rc.data());
> // return rc;

This gets its ctype from the current default locale. It ought to work.

Except that std::use_facet perhaps doesn't work for VC6, and you might have
to use a macro workaround _USE(loc, facet) instead?


Or upgrade, if you can!


Bo Persson

Igor Tandetnik

unread,
Nov 25, 2006, 9:35:26 AM11/25/06
to
"Jeffrey Walton" <nolo...@gmail.com> wrote in message
news:1164457112....@f16g2000cwb.googlegroups.com

> My question is in reference to MultiByte Character Sets. Will this
> code perform as expected? I understand every problem has a simple and
> elegant solution that is wrong.
>
> for( std::string::const_iterator it = s.begin();
> it != s.end(); it++ )
> {
> ws += ct.widen( *it );
> }

This overload of widen() has absolutely no chance to handle MBCS
correctly, as it only sees one byte at a time. You need to use the other
overload, widen(char*, char*, CharType*).
--
With best wishes,
Igor Tandetnik

With sufficient thrust, pigs fly just fine. However, this is not
necessarily a good idea. It is hard to be sure where they are going to
land, and it could be dangerous sitting under them as they fly
overhead. -- RFC 1925


Jeffrey Walton

unread,
Nov 25, 2006, 5:47:19 PM11/25/06
to

Bo Persson wrote:
> Jeffrey Walton wrote:
> > Hi All,
> >
> > I've done a little homework...
> > SNIP...

> >
> > My question is in reference to MultiByte Character Sets. Will this
> > code perform as expected?
> >
> > SNIP...
> >

Hi Bo,

> > std::string s = "Hello World";
> > std::ctype<wchar_t> ct;
>
> This is using the default version of ctype, not necessarily the one in the
> current locale. That's why the other codes have a use_facet<>() to retrieve
> the current active version.

Thank you.

> > SNIP


> > // const std::locale& loc = s.getloc();
> > // wchar_t w = std::use_facet< std::ctype<char> >
> > // (loc).widen(c);
> > // does not compile in Microsft's VC++ 6.0 environment...
> > // getloc() is not a member of std::basic_string< ... > ...
>
> Here s cannot be a string, but a stream. A stream has a locale, a string
> does not.

Makes sense now that you state it.

>
> > //
> > // wchar_t wc = std::use_facet< std::ctype<wchar_t> >
> > // (out.getloc()).widen(*it);
> > // does not compile in Microsft's VC++ 6.0 environment...
>
> VC6 has its set of own limitations, especially with templates. Don't use it,
> if you don't absolutely have to.

I don't really have a choice here :/

> > SNIP


>
> This gets its ctype from the current default locale. It ought to work.
>
> Except that std::use_facet perhaps doesn't work for VC6, and you might have
> to use a macro workaround _USE(loc, facet) instead?
>

> SNIP...

I'm going to try _USE(loc, facet) immediately. I'll kick myself if that
is all I had to do...

Jeff

Jeffrey Walton

unread,
Nov 25, 2006, 5:48:50 PM11/25/06
to

Igor Tandetnik wrote:
> "Jeffrey Walton" <nolo...@gmail.com> wrote in message
> news:1164457112....@f16g2000cwb.googlegroups.com
> > My question is in reference to MultiByte Character Sets. Will this
> > code perform as expected? I understand every problem has a simple and
> > elegant solution that is wrong.
> >
> > for( std::string::const_iterator it = s.begin();
> > it != s.end(); it++ )
> > {
> > ws += ct.widen( *it );
> > }
>
> This overload of widen() has absolutely no chance to handle MBCS
> correctly, as it only sees one byte at a time. You need to use the other
> overload, widen(char*, char*, CharType*).
> --
> With best wishes,
> Igor Tandetnik
Thanks Igor

Ulrich Eckhardt

unread,
Nov 27, 2006, 2:43:50 AM11/27/06
to
Igor Tandetnik wrote:
> "Jeffrey Walton" <nolo...@gmail.com> wrote in message
> news:1164457112....@f16g2000cwb.googlegroups.com
>> My question is in reference to MultiByte Character Sets. Will this
>> code perform as expected? I understand every problem has a simple and
>> elegant solution that is wrong.
>>
>> for( std::string::const_iterator it = s.begin();
>> it != s.end(); it++ )
>> {
>> ws += ct.widen( *it );
>> }
>
> This overload of widen() has absolutely no chance to handle MBCS
> correctly, as it only sees one byte at a time. You need to use the other
> overload, widen(char*, char*, CharType*).

That doesn't matter, all C++ IOStreams assume that one internal character is
also a complete character. IOW, it can't handle multibyte encodings (like
UTF-8) internally anyway (and yes, that also includes UTF-16, which is what
MS map their wchar_t to!).

Uli

Tom Widmer [VC++ MVP]

unread,
Nov 27, 2006, 5:03:45 AM11/27/06
to

This should work:

#include <string>
#include <locale>

std::wstring to_wide_string(std::string const& source) {

typedef std::ctype<wchar_t> CT;

std::wstring rc('\0', source.size());
CT const& ct = std::_USE(std::locale(), CT);
ct.widen(source.data(), source.data() +
source.size(), &rc[0]);
return rc;
}

It relies on the non-portable assumption that std::wstring is
contiguous, but that's certainly true in VC6.

Tom

Jeffrey Walton

unread,
Dec 11, 2006, 6:13:11 PM12/11/06
to
Tom Widmer [VC++ MVP] wrote:
> Jeffrey Walton wrote:
> > Hi All,
> >
> > I've done a little homework (I've read responses to similar from P.J.
> > Plauger and Dietmar Kuehl), and wanted to verify with the Group. Over
> > in comp.lang.c++, I'm getting a lot of boot Microsoft answers (which
> > does not help me). Below is what I am performing (Stroustrup's Appendix
> > D recommendations won't compile in Microsoft VC++ 6.0).
> >
> > SNIP

>
> This should work:
>
> #include <string>
> #include <locale>
>
> std::wstring to_wide_string(std::string const& source) {
> typedef std::ctype<wchar_t> CT;
> std::wstring rc('\0', source.size());
> CT const& ct = std::_USE(std::locale(), CT);
> ct.widen(source.data(), source.data() +
> source.size(), &rc[0]);
> return rc;
> }
>
> It relies on the non-portable assumption that std::wstring is
> contiguous, but that's certainly true in VC6.
>
> Tom

Hi Tom,

It took a little massaging, but I finally got the following to work
(access violations in the original code: I think it was 'std::wstring
rc('\0', source.size())'). Thank you very much. Sorry about the late
response.

Jeff

// Courtesy of Tom Widmer (VC++ MVP)
std::wstring Widen(std::string const& source) {

typedef std::ctype<wchar_t> CT;

std::wstring rc;
rc.resize( source.length() + 1 );

CT const& ct = std::_USE(std::locale(), CT);

ct.widen(source.begin(), source.end(), &rc[0]);

return rc;
}

Jeffrey Walton

unread,
Dec 11, 2006, 6:23:29 PM12/11/06
to

Jeffrey Walton wrote:
>
> SNIP

>
> // Courtesy of Tom Widmer (VC++ MVP)
> std::wstring Widen(std::string const& source) {
>
> typedef std::ctype<wchar_t> CT;
>
> std::wstring rc;
> rc.resize( source.length() + 1 );
>
> CT const& ct = std::_USE(std::locale(), CT);
> ct.widen(source.begin(), source.end(), &rc[0]);
>
> return rc;
> }

I think I was able to dress it up aeven a bit prettier (got rid of
'&rc[0]'):

// Courtesy of Tom Widmer (VC++ MVP)

#include <locale>
std::wstring Widen(std::string const& narrow) {

typedef std::ctype<wchar_t> CT;

std::wstring wide;
wide.resize( narrow.length() + 1 );

CT const& ct = std::_USE(std::locale(), CT);

ct.widen( narrow.begin(), narrow.end(), wide.begin() );

return wide;
}

Jeffrey Walton

unread,
Dec 11, 2006, 8:16:07 PM12/11/06
to
Whoops... I broke std::string find (narrow was a substring for which I
was searching in s). DO NOT use narrow.length() + 1.

// Courtesy of Tom Widmer (VC++ MVP)
#include <locale>
std::wstring Widen( std::string const& narrow ) {

typedef std::ctype<wchar_t> CT;

std::wstring wide;
wide.resize( narrow.length() );

CT const& ct = std::_USE(std::locale(), CT);
ct.widen( narrow.begin(), narrow.end(), wide.begin() );

return wide;

Tom Widmer [VC++ MVP]

unread,
Dec 12, 2006, 4:58:49 AM12/12/06
to
Jeffrey Walton wrote:
> Whoops... I broke std::string find (narrow was a substring for which I
> was searching in s). DO NOT use narrow.length() + 1.
>
> // Courtesy of Tom Widmer (VC++ MVP)
> #include <locale>
> std::wstring Widen( std::string const& narrow ) {
>
> typedef std::ctype<wchar_t> CT;
>
> std::wstring wide;
> wide.resize( narrow.length() );

Those two lines should compress to (I missed the L before and had the
args in the wrong order):
std::wstring wide(narrow.length(), L'\0');

>
> CT const& ct = std::_USE(std::locale(), CT);
> ct.widen( narrow.begin(), narrow.end(), wide.begin() );

For portability, that should be:

ct.widen(&narrow[0], &narrow[0] + narrow.size(), &wide[0]);

The reason is that string iterators are not normally pointers (they are
in VC6, but not in any later VC). You should be careful never to treat
iterators as pointers, or you'll be making yourself major headaches if
you ever need to get the codebase working with a more recent version of VC.

So, to reiterate:
Assuming that std::(w)string is contiguous is a non-standard but
portable assumption (I don't know of a library where it isn't true).
Assuming that std::string::iterator is char* is a non-standard,
non-portable assumption (it doesn't hold reliably for any up-to-date
compiler).

Tom

Jeffrey Walton

unread,
Dec 12, 2006, 9:00:08 PM12/12/06
to
Hi Tom,

Thnaks again. I went portable (though ugly). Below is the Narrow in
case others want to try it. I chose to use underscore as he default
character ('_').

Jeff

// Courtesy of Tom Widmer (VC++ MVP)

std::string StringNarrow( const std::wstring& wide ) {

typedef std::ctype<wchar_t> CT;

std::string narrow;
narrow.resize( wide.length() );

CT const& ct = std::_USE(std::locale(), CT);

// Non-Portable
// ct.narrow( wide.begin(), wide.end(), '_', narrow.begin() );

// Portable
ct.narrow( &wide[0], &wide[0] + wide.size(), '_', &narrow[0] );

return narrow;
}

Tom Widmer [VC++ MVP] wrote:
> Jeffrey Walton wrote:
> > SNIP
>

Jeffrey Walton

unread,
Dec 12, 2006, 9:00:19 PM12/12/06
to
Hi Tom,

Thnaks again. I went portable (though ugly). Below is the Narrow in
case others want to try it. I chose to use underscore as he default
character ('_').

Jeff

// Courtesy of Tom Widmer (VC++ MVP)
std::string StringNarrow( const std::wstring& wide ) {

typedef std::ctype<wchar_t> CT;

std::string narrow;
narrow.resize( wide.length() );

CT const& ct = std::_USE(std::locale(), CT);

// Non-Portable


// ct.narrow( wide.begin(), wide.end(), '_', narrow.begin() );

// Portable
ct.narrow( &wide[0], &wide[0] + wide.size(), '_', &narrow[0] );

return narrow;
}

Tom Widmer [VC++ MVP] wrote:
> Jeffrey Walton wrote:
> > SNIP
>

0 new messages