UTF-8 char[] sequence to UCS 2 wchar

Bill Holt

unread,

Jan 7, 2004, 3:59:40 AM1/7/04

to

Hi,

I'm trying to convert an UTF-8 character sequence into wchar_t using
MultiByteToWideChar API call with CP_UTF8 in the code page argument.
The result wchar_t is not correct. I'm wondering if there's any
special requirement of using this API?

And also, is there any simple function can do that? Preferablly using
std::codecvt. Examples are appreciated!

Regards,
Bill

Tim Robinson

unread,

Jan 7, 2004, 2:46:55 PM1/7/04

to

"Bill Holt" <mail...@21cn.com> wrote in message
news:74e0d43f.04010...@posting.google.com...

> I'm trying to convert an UTF-8 character sequence into wchar_t using
> MultiByteToWideChar API call with CP_UTF8 in the code page argument.
> The result wchar_t is not correct. I'm wondering if there's any
> special requirement of using this API?

Yes, providing some more details or some source code when you post to a
newsgroup :).

> And also, is there any simple function can do that? Preferablly using
> std::codecvt. Examples are appreciated!

I put some code together (attached) to convert between UTF-8 and TCHAR, to
be used by the standard locale functions. It's pretty evil.

Here's an example of its use in loading a UTF8-formatted text file:

std::basic_ifstream<TCHAR> str;
CharSetFacet *the_facet = new CharSetFacet(CP_UTF8);
std::locale loc(std::locale::empty());
str.imbue(std::_ADDFAC(loc, the_facet));
str.open(T2CA(m_filename.c_str()));

--
Tim Robinson (MVP, Windows SDK)
http://www.themobius.co.uk/

class CharSetFacet : public std::codecvt<TCHAR, char, mbstate_t>
{
protected:
UINT m_cp;
CPINFO m_info;
typedef TCHAR from_type;
typedef char to_type;

public:
enum { id = 1 };

CharSetFacet(UINT cp)
{
m_cp = cp;
GetCPInfo(cp, &m_info);
}

protected:
result do_in(state_type& _State,
const to_type *first1, const to_type *last1, const to_type *& next1,
from_type *first2, from_type *last2, from_type *& next2) const
{
#ifdef UNICODE
if (MultiByteToWideChar(m_cp, 0, first1, last1 - first1,
first2, last2 - first2) == 0)
{
switch (GetLastError())
{
case ERROR_INSUFFICIENT_BUFFER:
case 0:
return partial;

default:
*first2 = *first1;
return ok;
}
}
else
return ok;
#else
memcpy(first2, first1, last2 - first2);
return ok;
#endif
}

virtual result do_out(state_type& state_type,
const from_type *first1, const from_type *last1, const
from_type *next1,
to_type *first2, to_type *last2, to_type *next2)
{
#ifdef UNICODE
if (WideCharToMultiByte(m_cp, 0, first1, last1 - first1,
first2, last2 - first2, NULL, NULL) == 0)
{
switch (GetLastError())
{
case ERROR_INSUFFICIENT_BUFFER:
case 0:
return partial;

default:
return error;
}
}
else
return ok;
#else
memcpy(first2, first1, last2 - first2);
return ok;
#endif
}

virtual bool do_always_noconv() const throw()
{
return m_info.MaxCharSize == 1;
}

virtual int do_max_length() const throw()
{
return m_info.MaxCharSize;
}

virtual int do_encoding() const throw()
{
return 0;
}

virtual int do_length(state_type& state_type, from_type *first1,
const from_type *last1, size_t len2) const throw()
{
#ifdef UNICODE
return WideCharToMultiByte(m_cp, 0, first1, last1 - first1,
NULL, 0, NULL, NULL);
#else
return len2;
#endif
}
};

Bill Holt

unread,

Jan 7, 2004, 10:20:17 PM1/7/04

to

Thanks Tim, but umm, I'm trying to do a few test on your code. In my
situation, I need to use std::basic_istringstream 'cause I'm dealing
with memory buffers. And also I #define UNICODE before including
windows.h.

The problem is that
std::basic_istringstream<TCHAR> str
is actually defined as
std::basic_istringstream<wchar_t> str
when UNICODE is defined, which makes it impossible for str to .read()
non-wchar_t characters.

How can I feed char* characters into the stream?

Regards,
Bill

"Tim Robinson" <tim.at.gaat.f...@invalid.com> wrote in message news:<bthnnh$7ctcn$1...@ID-103400.news.uni-berlin.de>...

> Yes, providing some more details or some source code when you post to a
> newsgroup :).
>

Tim Robinson

unread,

Jan 8, 2004, 2:48:06 PM1/8/04

to

The idea behind this code is that the str object reads bytes from disks and
converts them automatically, by means of the the_facet object, to wchar_t.
The only thing that knows about the byte representation of characters is
CharSetFacet; anything that calls str.read() can expect UCS-2.

If you really did want to read or write chars through str, you'd have to
convert them to UCS-2, then call methods on str, which would convert them to
UTF-8 using CharSetFacet.

--
Tim Robinson (MVP, Windows SDK)
http://www.themobius.co.uk/

"Bill Holt" <mail...@21cn.com> wrote in message
news:74e0d43f.0401...@posting.google.com...

UTF-8 char[] sequence to UCS 2 wchar_t sequence conversion

Bill Holt

Tim Robinson

Bill Holt

Tim Robinson