Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

cstring to cbytearray conversion

287 views
Skip to first unread message

mfc

unread,
Aug 31, 2010, 2:37:31 AM8/31/10
to
Hi,

I`ve some problems converting a CString to a CByteArray. The
CByteArray contain the complete ascii-string (unicode) but after each
character there`s a 0x00 byte....

I don`t know where the 0x00 will come from and how I can prevent this
phenomenon.

CByteArray dataArray;
CString test = _T("Hallo-Welt");
dataArray.SetSize(test.GetLength());
memcpy(dataArray.GetData(),test, test.GetLength());

//in another function -> therefore dataArray = data
CByteArray * data = (CByteArray *)wParam;
CAsyncSocket::Send(data->GetData(), data->GetSize(), 0);

After that I transmit this dataArray by the ethernet connection. Using
whireshark I get the dataArray [0x48 0x00 0x61 0x00 0x6c 0x00 0x6c
0x00 0x6F 0x00]

best regards
Hans

Uwe Kotyczka

unread,
Aug 31, 2010, 3:10:19 AM8/31/10
to

If you have UNICODE defined, this is just the way it works.
In UNICODE each letter takes 2 bytes. CString::GetLength
gives you the number of letters, not the number of bytes.
MSDN does not state that clearly (only for MBCS).
Look into the source code of CString to see how memory
allocation is done.

HTH

Goran Pusic

unread,
Aug 31, 2010, 4:13:46 AM8/31/10
to
On Aug 31, 8:37 am, mfc <mfcp...@googlemail.com> wrote:
> Hi,
>
> I`ve some problems converting a CString to a CByteArray. The
> CByteArray contain the complete ascii-string (unicode) but after each
> character there`s a 0x00 byte....

You need to know what exactly is in your bytes. Since you are on
Windows, and given the rest of your explanation, chances are that your
text is Unicode text encoded in UTF-16, little-endian. (a.k.a.
UTF-16LE). Look it up in Wikipedia or wherever.

For example, if your string is some plain English, say "ABC", you'll
have

41 00 42 00 43 00. (41, 42, 43 hex are the Unicode code points for
'A', 'B', 'C'). UTF-16LE encoding of these three Unicode letters gives
you the sequence.

Now... Speaking of UTF-16LE encoding... Plain English text will always
have a letter in one byte, then 0 in anoother. But as soon as you have
something else, you will have these zeroes disappear. Further, most of
all text anywhere will have one character every two bytes, but! But...
There are languages where you'll need 4 bytes for one character. So if
you're counting letters in UTF-16 encoded text, you'll need to detect
that. I think that there are system functions for that, and then
there's a library called ICU that handles Unicode text.

All in all, you need to know a thing or two about Unicode and UTF-16.
Forget ASCII, there's no such thing in 21st century. Even what you
seem to call ASCII isn't that, it actually is something else.

Goran.

David Webber

unread,
Aug 31, 2010, 7:40:53 AM8/31/10
to

"mfc" <mfc...@googlemail.com> wrote in message
news:74d0c6de-04f5-48ec...@v41g2000yqv.googlegroups.com...

> I`ve some problems converting a CString to a CByteArray. The
> CByteArray contain the complete ascii-string (unicode) but after each
> character there`s a 0x00 byte....
>
> I don`t know where the 0x00 will come from and how I can prevent this
> phenomenon.

In Unicode (UTF-16 as adopted by Microsoft) standard characters are 2 bytes.
ASCII characters are in the range where the most significant byte is zero.
So if you have a string of ascii characters represented in Unicode, every
other byte will be zero. (And Intel chips are little-endian - so the least
significant (non-zero) byte in each pair comes first.) Put in a musical
double sharp sign and you'll find that both its bytes are non-zero!

You can prevent the phenomenon by abandoning Unicode, but that's a bad idea.

Dave
--
David Webber
Mozart Music Software
http://www.mozart.co.uk
For discussion and support see
http://www.mozart.co.uk/mozartists/mailinglist.htm

Goran Pusic

unread,
Aug 31, 2010, 7:53:25 AM8/31/10
to
On Aug 31, 1:40 pm, "David Webber" <d...@musical-dot-demon-dot-co.uk>
wrote:

> In Unicode (UTF-16 as adopted by Microsoft) standard characters are 2 bytes.

I think you need to define the term "standard characters" here ;-).

Goran.

mfc

unread,
Aug 31, 2010, 9:05:00 AM8/31/10
to
>You can prevent the phenomenon by abandoning Unicode, but that's a bad idea.
there`s no other possibility?

Goran Pusic

unread,
Aug 31, 2010, 9:41:07 AM8/31/10
to
On Aug 31, 3:05 pm, mfc <mfcp...@googlemail.com> wrote:
> >You can prevent the phenomenon by abandoning Unicode, but that's a bad idea.
>
> there`s no other possibility?

Yes, but they are worse options ;-).

In the code snippet you have, what encoding do you expect on the
receiving end of the socket? My guess is, something you think is
ASCII?

But I will presume that booth sending and receiving machines are a
Windows one, and that it's set to an English-speaking part of the
world, and that receiving program is not a Unicode one. If all these
conditions are met, chances are that actual encoding of your text, on
the receiving side, should be Windows-1252 (http://en.wikipedia.org/
wiki/Windows-1252). If so, you can simply change your snippet to:

CByteArray dataArray;
CStringA test = "Hallo-Welt"; // CHANGE HERE ("A": +, "_T": -)
dataArray.SetSize(test.GetLength());
memcpy(dataArray.GetData(),test, test.GetLength());

//in another function -> therefore dataArray = data
CByteArray * data = (CByteArray *)wParam;
CAsyncSocket::Send(data->GetData(), data->GetSize(), 0);

That is, you can use locale specific, predefined encoding on both
sides and be just fine. But of course, a better idea is specify actual
encoding of data in the data itself. This is commonly done in e.g. XML
or HTML. And you should specify encoding even if you use Unicode,
because if you ever naively send your UTF-16LE text to e.g. a Linux
machine, it might try to read it as UTF-8, and will fail. Or, if you
send it to a big endian machine that knows UTF-16, it, too, might
fail, as UTF16-LE != UTF-16BE.

It's __all__ about bytes that travel over the wire. You must know
their meaning and encoding first. All else will be easy, just ask
people (e.g. here).

Goran.

Cholo Lennon

unread,
Aug 31, 2010, 9:57:16 AM8/31/10
to
On 31/08/2010 10:05, mfc wrote:
>> You can prevent the phenomenon by abandoning Unicode, but that's a bad idea.
> there`s no other possibility?

How about transmitting the data in XML format?

--
Cholo Lennon
Bs.As.
ARG

David Webber

unread,
Aug 31, 2010, 12:27:42 PM8/31/10
to
"Goran Pusic" <gor...@cse-semaphore.com> wrote in message
news:846efa94-34ff-4ec7...@m1g2000yqo.googlegroups.com...

> On Aug 31, 3:05 pm, mfc <mfcp...@googlemail.com> wrote:
>> >You can prevent the phenomenon by abandoning Unicode, but that's a bad
>> >idea.
>>
>> there`s no other possibility?
>
> Yes, but they are worse options ;-).

:-)

> In the code snippet you have, what encoding do you expect on the
> receiving end of the socket? My guess is, something you think is
> ASCII?
>
> But I will presume that booth sending and receiving machines are a
> Windows one, and that it's set to an English-speaking part of the

> world,...

Wales is a largely English-speaking part of the world. But it's alphabet
contains w with a ^ on. So if you want to mention Owain Glyndwr, you have
to misspell him (as I have done here) in 1252 or any other code page.

Unicode is a seriously good idea, and my advice is to live with the zero
bytes.

David Webber

unread,
Aug 31, 2010, 12:30:52 PM8/31/10
to

"Cholo Lennon" <cholo...@hotmail.com> wrote in message
news:i5j1ns$dha$1...@speranza.aioe.org...


> On 31/08/2010 10:05, mfc wrote:
>>> You can prevent the phenomenon by abandoning Unicode, but that's a bad
>>> idea.
>> there`s no other possibility?
>
> How about transmitting the data in XML format?

But with characters like &thorn; (or however else you want to encode it)
you'll start to long for the days when you had only a small number of easily
distinguishable zero bytes :-)

David Webber

unread,
Aug 31, 2010, 12:22:25 PM8/31/10
to

"Goran Pusic" <gor...@cse-semaphore.com> wrote in message

news:f66bf149-05d3-4719...@e14g2000yqe.googlegroups.com...

Not really :-) If the OP is surprised by the occurrence of zero bytes,
getting into the realms of thingy-whatsit (ah 'surrogate' - that's the word)
pairs in UTF-16 will confuse more than it will help :-)

mfc

unread,
Aug 31, 2010, 12:44:42 PM8/31/10
to
The real application is a webserver; I`ve installed the webserver
(windows embedded) which tx the http data as unicode; of course I`m
using the encoding header in the http protocol (UTF-8) to signal the
user how the data will be translated - but I don`t know how it is
possible to display the webpage at the computer of a customer (where I
do not know the settings - unicode or not).

Because there`s no information flag for the first part of my http-
response to signal the user - hey it`s unicode!

mfc

unread,
Aug 31, 2010, 2:27:14 PM8/31/10
to

using multibyte code - everything is working and the page will be
displayed correct

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://
www.w3.org/TR/html4/frameset.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" >
<title>Der Titel</title>
</head>
<body>
...
</body>
</html>

//add content type (responseHdr.AddContentType(_T("text/html;
charset=UTF-8"));)

and the whole message which will be transmitted is:

HTTP/1.0 200 OK
Content-Length: 263
Content-Type:text/html; charset=UTF-8

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://
www.w3.org/TR/html4/frameset.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" >
<title>Der Titel</title>
</head>
<body>
...
</body>
</html>


any hints???

Goran

unread,
Aug 31, 2010, 2:37:10 PM8/31/10
to
On Aug 31, 6:44 pm, mfc <mfcp...@googlemail.com> wrote:
> The real application is a webserver; I`ve installed the webserver
> (windows embedded) which tx the http data as unicode; of course I`m
> using the encoding header in the http protocol (UTF-8)

OK, that's better as context. To send UTF-8 data, you simply use
WideCharToMultiByte with CP_UTF8, that's all. That gives you a
sequence of char's that you can use to form your HTTP stream (I am not
too sure if some character that is not encoded as one octet in UTF-8
can be present in the HTTP header, and I don't know what you should do
if that is possible; but as for the content, you're golden).

Goran.

P.S. If you want to do HTTP, you should really consider using
something more high-level, not sockets (WinInet?).

mfc

unread,
Aug 31, 2010, 2:50:24 PM8/31/10
to

thanks for your answer: I also tested utf-16le instead of utf-8 - but
I get the same result :-(; firefox don`t understand the content-type
(therefore I always get a prompt window which indicate the default
encoding (application/octet-stream) and if I want to store this file
(which includes the whole text which I txd from the webserver)...

The webpage should be shown in several languages (German, English and
Cyrillic (therefore unicode))

mfc

unread,
Aug 31, 2010, 3:00:20 PM8/31/10
to
> Cyrillic (therefore unicode))- Zitierten Text ausblenden -
>
> - Zitierten Text anzeigen -

using multibyte code together with the charset utf-16le works - I also
get the Cyrillic characters... but I don`t know why unicode won`t
work. I receive the whole http-code (webpage) using wireshark. There
are no bytes missing.

David Webber

unread,
Aug 31, 2010, 5:37:23 PM8/31/10
to

"mfc" <mfc...@googlemail.com> wrote in message

news:8b3ad7f8-f741-42e9...@q22g2000yqm.googlegroups.com...


> On 31 Aug., 18:44, mfc <mfcp...@googlemail.com> wrote:
>> The real application is a webserver; I`ve installed the webserver
>> (windows embedded) which tx the http data as unicode; of course I`m
>> using the encoding header in the http protocol (UTF-8) to signal the
>> user how the data will be translated - but I don`t know how it is
>> possible to display the webpage at the computer of a customer (where I
>> do not know the settings - unicode or not).
>>
>> Because there`s no information flag for the first part of my http-
>> response to signal the user - hey it`s unicode!
>
> using multibyte code - everything is working and the page will be
> displayed correct

>....


> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" >

>...

Careful with definitions. That defines the character set to be Unicode
UTF-8; I believe what Microsoft calls 'multibyte', was a pre-unicode
system.

I too have web pages specified as Unicode UTF-8 but I don't know how well
browsers respond to it if the characters start getting 'fancy' (see below).
Presumably reasonably well these days?

Just as a guideline:

Unicode assigns every character ever invented in any language, and others
like mathematical and music symbols to a unique 'code point'.

But there are various ways that the code points can be represented:

UTF-32 has 4 bytes per character. Every code point is represented as a
single 4-byte integer. The human race has apparently not yet invented more
than 2^32 characters, so they're safe for the moment. I'm not sure if
anyone uses UTF-32 though :-(

UTF-16 has 2 byte units. Most of the characters you'll ever need occupy a
single 2 byte unit. But the human race *has* invented more than 2^16
characters and so some characters need 2 2-byte units - they are called
'surrogate pairs'. [In that case, the first 2-byte thing is one of a set
which says "hey man I'm not a character by myself, you have to take me
together with the next one".] Microsoft NT/2000/XP/Vista/7 operating
systems use UTF-16 little endian as the native character type. (There are
systems with name s like UCS2 which are like Unicode UTF-16 but without the
surrogate pairs IIRC.) But if you utter the words 'Unicode' and
'Microsoft' in the same breath, you probably mean Unicode with the UTF16
encoding.

UTF-8 has 1 byte units. The first 127 or so (I'm vague on details) are
represent the first 127 code points (which happen to be the old ASCII set)
but in general a single code point need up to 4 of the 1-byte units to
represent it, and the higher bytes signal that one of these is coming.
You can surely find the specs for these with Google. UTF-8, as you've
noticed is an option for web pages.

But my point is that your original question now looks as if it was not how
to get rid of zero bytes from a CString, but how to convert a UTF-16 Unicode
string to UTF-8. Now I've never done it, but I'm sure lots of people have,
and if you start Googling with the question framed thus, you may have
success.

Ah, but reading ahead, I see Goran has given the answer. Nevetheless I hope
the above why's and wherefore's help.

Goran Pusic

unread,
Sep 1, 2010, 2:50:14 AM9/1/10
to

I am not sure about the content-type HTTP header, but one thing I am
sure about: if your actual HTML is UTF-8 and you do specify UTF-8 as
in "<meta http-equiv="Content-Type" content="text/html; charset=utf-8"
>", your HTML is correct and any browser will display it. I would
guess that same goes for UTF-16LE, but I never tried.

As for actual Content-type header, not my cup of tea.

What I would do if I were you, would be to save octets I have to disk,
open them with a binary editor (VS one is fine) and that would show
whether all if fine.

Goran.

Goran Pusic

unread,
Sep 1, 2010, 2:54:02 AM9/1/10
to

Well, show actual octets (HTTP header and HTML content). Perhaps
that'll explain something.

BTW, I am not sure what you mean by "multibyte code together with the
charset utf-16le" - that doesn't make sense to me.
"Multibyte" (character set) is Microsoft name for non-unicode text,
with specific encodings for each language. I don't know how you can
mix multibyte and any Unicode encoding, that should work only by
chance. You should probably use better terms for things you have in
code ;-).

Goran.

mfc

unread,
Sep 1, 2010, 6:01:08 AM9/1/10
to

Using unicode in the project settings of VS -> I get utf-16le as
characters-set. I`m using wireshark to display the binary data I get;
and its what you would expect if you are using utf-16le. But the web-
browser (IE, or firefox) won`t display the page.... that`s my problem
and at the moment it seems that there`s no solution?!?

mfc

unread,
Sep 1, 2010, 6:03:57 AM9/1/10
to
> and at the moment it seems that there`s no solution?!?- Zitierten Text ausblenden -

>
> - Zitierten Text anzeigen -

0x48 0x00 0x54 0x00 0x54 0x00 0x50 0x00 0x2F 0x00 and so on - is what
I get using unicode....

David Wilkinson

unread,
Sep 1, 2010, 7:13:07 AM9/1/10
to
mfc wrote:
> 0x48 0x00 0x54 0x00 0x54 0x00 0x50 0x00 0x2F 0x00 and so on - is what
> I get using unicode....

I think I am losing track of the problem here, but it seems to me that if your
CString is "Unicode", which means utf-16 in Microsoft-land, and you want your
content to be utf-8, then you should convert it using WideCharToMultiByte with
CP_UTF8 flag.

Any string, in any language, can be converted between utf-16 and utf-8 (in
either direction) without loss, because both utf-16 and utf-8 are full encodings
of the Unicode character set.

--
David Wilkinson

Goran Pusic

unread,
Sep 1, 2010, 7:51:51 AM9/1/10
to
> 0x48 0x00 0x54 0x00 0x54 0x00 0x50 0x00 0x2F 0x00 and so on - is what
> I get using unicode....

OK, that should be fine. What are you specifying in Content-type in
your HTTP headers (not in your HTML)? It should be "Content-Type: text/
html; charset=UTF16-LE". My guess is (didn't try) is that if you just
use Unicode and CStringW (maps to CString if you compile with
UNICODE), and if your server declares "Content-Type: text/html;
charset=UTF16-LE" in HTTP header, you should be fine. I checked with a
file on the disk, Firefox understands UTF-16BE/LE (although through
Byte Order Mark, a.k.a. BOM), so I would be highly surprised if you
couldn't use UTF-16 from a server.

But note that your HTTP headers should be in one-octet-per-character,
that is, if you're creating them from MFC code, use CStringA ;-).

So, summary:

* "plain 7-bit ASCII" for HTTP;
* have Content-Type: text/html; charset=UTF16-LE in HTTP headers
* CStringW for HTML

Drop a line when things start working ;-)

Goran.

mfc

unread,
Sep 1, 2010, 7:59:06 AM9/1/10
to

ok I use this function to convert the stream to utf-8; but now all
Cyrillic characters won`t be displayed correct.

I hope I`m right, I only have to use this function WideCharToMultiByte
for the http header information and the http body will be in utf-16le
to get all the required information?

mfc

unread,
Sep 1, 2010, 8:38:32 AM9/1/10
to
> Goran.- Zitierten Text ausblenden -

>
> - Zitierten Text anzeigen -

thanks for your answer; it`s nearly working; but if I add some
Cyrillic characters to the html-resource, there`s only the chance to
store this html-file as UTF-8. After that opening the file with

LPSTR lpcHtml = static_cast<LPSTR>(LockResource(hHeader));

At the beginning, there are two strange characters; after that the
doctype will be shown... Either the storage-format is wrong or I can`t
use LPSTR to open the resource...


Hans

Goran Pusic

unread,
Sep 1, 2010, 9:01:13 AM9/1/10
to
On Sep 1, 2:38 pm, mfc <mfcp...@googlemail.com> wrote:
> thanks for your answer; it`s nearly working; but if I add some
> Cyrillic characters to the html-resource, there`s only the chance to
> store this html-file as UTF-8. After that opening the file with
>
> LPSTR lpcHtml = static_cast<LPSTR>(LockResource(hHeader));
>
> At the beginning, there are two strange characters;

Two or three? There's a thing called a Byte Order Mark (BOM) that's
used in Windows to store encoding. So editors save said BOM and read
it back to understand encoding. Notepad does it, and VS editor does it
as well. So you have EF,BB,BF (BOM for UTF-8) and FE FF (BOM for
UTF-16). See http://en.wikipedia.org/wiki/Byte_order_mark.

If you do have such bytes in front, you should remove them. You can do
that with the binary editor of VS: click the small arrow and "Open
With" command on the Open button of the Open File dialog in VS, choose
"Binary editor". In it, delete first 2-three bytes, then save.

> after that the
> doctype will be shown... Either the storage-format is wrong or I can`t
> use LPSTR to open the resource...

You can. LPSTR (or any other cast, for that matter), has no bearing on
LockResource. LockResource gives you just a pointer, and it is
entirely up to you to interpret it. If you (by using a cast) say that
pointer points to a string, string it is.


Goran.

David Wilkinson

unread,
Sep 1, 2010, 11:43:49 AM9/1/10
to
mfc wrote

> ok I use this function to convert the stream to utf-8; but now all
> Cyrillic characters won`t be displayed correct.

They should be.

> I hope I`m right, I only have to use this function WideCharToMultiByte
> for the http header information and the http body will be in utf-16le
> to get all the required information?

Very few web sites use UTF-16, but UTF-8 is very common. The great things about
UTF-8 are that

(a) Characters 0-127 are the same as ASCII

(b) It can represent any content, in any language.

If your page is not displaying correctly, you must be doing the conversion wrong.

--
David Wilkinson

mfc

unread,
Sep 1, 2010, 3:12:58 PM9/1/10
to
On 1 Sep., 17:43, David Wilkinson <no-re...@effisols.com> wrote:
> mfc wrote
>
> > ok I use this function to convert the stream to utf-8; but now all
> > Cyrillic characters won`t be displayed correct.
>
> They should be.
>

I got my mistake; utf-16 will include a BOM at the beginning of the
file 0xFF 0xFE -> therefore I have to delete this BOM when I try to
transmit the data to the webserver


CStringW myPage;
myPage.GetHTML(IDR_HTML_TEST);

myPage.RemoveAt(0);
myPage.RemoveAt(0);

I`ll test it tomorrow, if it will work


David Wilkinson

unread,
Sep 1, 2010, 3:27:33 PM9/1/10
to
mfc wrote:
> I got my mistake; utf-16 will include a BOM at the beginning of the
> file 0xFF 0xFE -> therefore I have to delete this BOM when I try to
> transmit the data to the webserver
>
>
> CStringW myPage;
> myPage.GetHTML(IDR_HTML_TEST);
>
> myPage.RemoveAt(0);
> myPage.RemoveAt(0);
>
> I`ll test it tomorrow, if it will work

But this does not have to do with UTF-8 I think; it has to do with encoding the
HTML as UTF-16.

While UTF-16 is the native encoding of recent Windows OS's
(NT-2000-XP-Vista-Win7), the encoding most used on the internet is UTF-8. If you
mostly use ASCII text, UTF-16 pages will be twice as large as UTF-8 ones.

--
David Wilkinson

Goran Pusic

unread,
Sep 2, 2010, 3:40:59 AM9/2/10
to
FWIW, I agree with David, for anything "internet", UTF-8 is a better
choice.

It's not hard: save/manipulate all your stuff in Unicode encoding
supported by the system (here: UTF-16LE). Convert to UTF-8 for that
final send over the wire. So just use CString with UNICODE (that is,
use CStringW), and, before you put it on the wire, use
WideCharToMultiByte(CP_UTF8, ...).

(
To do that easily, you'll probably decide that you need helpers like
these:

// Use WideCharToMultiByte(CP_UTF8, ...) here.
CStringA UTF8(const CStringW& src);
CStringA UTF8(LPCWSTR src);
CStringA UTF8(LPCWSTR src, int length);
)

Goran.

mfc

unread,
Sep 3, 2010, 8:47:00 AM9/3/10
to
ok it`s working now; but one problem still remains...

If I add some Cyrillic characters into the html resource. After
loading this resource (utf-16le format) and using
WideCharToMultiByte(CP_UTF8...) - all Cyrillic characters will be
shown wrong!

I think I know the reason, because the character-length of the
utf-16le string and the utf-8 string is not the same ->
utf8string.GetLength() != utf16lestring.GetLength(). It seams that
WideCharToMultiByte() won`t work correct if there are really 16bit
characters...

By the way I stored the html resource as utf-8 (without signature) in
vs. Maybe someone has a hint.

best regards
Hans

David Wilkinson

unread,
Sep 3, 2010, 9:10:23 AM9/3/10
to

Of course GetLength() returns different numbers for the UTF-16 (CStringW) and
UTF-8 (CStringA) versions of the string. This will always be the case if there
are non-ASCII characters in the string.

But you must be doing something wrong. WideCharToMultiByte() and
MultiByteToWideChar() can convert any string between UTF-16 and UTF-8 without loss.

--
David Wilkinson

mfc

unread,
Sep 3, 2010, 9:22:47 AM9/3/10
to
CString tester2 = _T("ю");
UINT size = tester2.GetLength();

CStringA utf8;
int len = WideCharToMultiByte(CP_UTF8, 0,
tester2.GetString(),
-1, NULL, 0, 0, 0);

char *ptr = utf8.GetBuffer(len-1);
if (ptr) WideCharToMultiByte(CP_UTF8, 0, tester2.GetString(), -1, ptr,
len, 0, 0);
utf8.ReleaseBuffer();

tester2 includes one Cyrillic character (0x044E) -> but the result in
utf8 is not 0x04 0x4E...

mfc

unread,
Sep 3, 2010, 10:29:44 AM9/3/10
to
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Beschreibung der Seite</title>
</head>
<body>
Б
</body>
</html>

In this small page is one Cyrillic character with the number 0x0411 ->
if I use WideCharToMultiByte() I will get the following bytes for this
line 0xc390e28098 0x0d0a; the last two bytes are \r\n. But why did I
get 5 bytes for one character? I thought I will only get one...

mfc

unread,
Sep 3, 2010, 10:56:04 AM9/3/10
to
If I use

CStringW tester2 = _T("Б");
CStringA utf8 = UTF16toUTF8(tester2);

and convert this to UTF-8 - everthing is working; but getting the data
from the html-resource (which is stored as UTF-8)
WideCharToMultiByte() doesn`t work... So maybe there`s the error

LPSTR lpcHtml = static_cast<LPSTR>(LockResource(hHeader));

rString = CStringW(lpcHtml);
CStringA utf8 = UTF16toUTF8(rString);

I hope someone has an idea

best regards
Hans

David Wilkinson

unread,
Sep 3, 2010, 4:05:22 PM9/3/10
to
mfc wrote:
> CString tester2 = _T("�");

The bytes of a UTF-8 encoding are not just the two bytes of the UTF-16 encoding.
UTF-8 character encodings can have 1, 2, 3 or 4 bytes. Only ASCII characters
require just one byte.

See this page for the character 0x044E:

<http://www.fileformat.info/info/unicode/char/44e/index.htm>

It says the UTF-8 for this character is 0xD1 0x8E.

--
David Wilkinson

Goran

unread,
Sep 4, 2010, 1:53:27 AM9/4/10
to
On Sep 3, 3:22 pm, mfc <mfcp...@googlemail.com> wrote:
> tester2 includes one Cyrillic character (0x044E) -> but the result in
> utf8 is not 0x04 0x4E...

That is normal. Look up how actual conversion from a unicode code
point to it's UTF-8 representation is done. It's rarely two bytes, and
is AFAIK never same two bytes that define said code point.

Goran.

Goran

unread,
Sep 4, 2010, 2:09:48 AM9/4/10
to
On Sep 3, 4:56 pm, mfc <mfcp...@googlemail.com> wrote:
> ...but getting the data

> from the html-resource (which is stored as UTF-8)
> WideCharToMultiByte() doesn`t work... So maybe there`s the error

That's definitely wrong. Input for this is lpWideCharStr, which means
UTF-16 string. If your HTML is saved as UTF-8, you should e.g. load it
into CStringA, not touch it at all, and send it like that. If you need
to manipulate your HTML before sending, you can that do either with
CStringA (or LPSTR) or CStringW (LPWSTR), but you have to take care of
knowing exactly what encoding you have in there (see below).

> LPSTR lpcHtml = static_cast<LPSTR>(LockResource(hHeader));
> rString = CStringW(lpcHtml);
> CStringA utf8 = UTF16toUTF8(rString);

Yeah, that's not good. Here, if your resource is in UTF-8, second line
is not doing a correct thing. This is because constructor you're
invoking is using "ANSI" code page (MSDN says: "CStringT( LPCSTR
lpsz) ... Constructs a Unicode CStringT from an ANSI string", and your
actual "code page" (quotes intended) is UTF-8.

The above can work only if your resource was saved as "ANSI" (or, more
precisely, using one of Windows-supported and non-unicode MBCS
encodings), and if actual code page used when program is running
matches code page used to save resource. That's not obviously not
possible if you want to mix e.g. Cyrillic and Latin characters - for
that, you definitely need multiple languages and multiple code pages,
and this "method" constrains you to one.

Goran.

P.S. If you actually need to e.g. replace some Cyrillic characters in
UTF-8, you are best off using some Unicode string manipulation library
(ICU springs to mind), or doing it in UTF-16 (CStringW).

Joseph M. Newcomer

unread,
Sep 4, 2010, 12:40:47 PM9/4/10
to
See below...
On Mon, 30 Aug 2010 23:37:31 -0700 (PDT), mfc <mfc...@googlemail.com> wrote:

>Hi,
>
>I`ve some problems converting a CString to a CByteArray. The
>CByteArray contain the complete ascii-string (unicode) but after each
>character there`s a 0x00 byte....
>
>I don`t know where the 0x00 will come from and how I can prevent this
>phenomenon.
>
>CByteArray dataArray;
>CString test = _T("Hallo-Welt");
>dataArray.SetSize(test.GetLength());
****
This is INCORRECT. The CORRECT form is

dataArray.SetSize(test.GetLength() * sizeof(TCHAR));

Nothing else will work correctly, for all the Unicode reasons that have already been
given.
****
>memcpy(dataArray.GetData(),test, test.GetLength());
****
That would be

memcpy(dataArray.GetData, test, test.GetLength() * sizeof(TCHAR);
*****
>
>//in another function -> therefore dataArray = data
>CByteArray * data = (CByteArray *)wParam;
****
Note that if you are using PostMessage, you cannot pass the address of dataArray, because
it is a local variable. If you are using SendMessage across threads, you are living
dangerously and should change to using PostMessage
****
>CAsyncSocket::Send(data->GetData(), data->GetSize(), 0);
>
>After that I transmit this dataArray by the ethernet connection. Using
>whireshark I get the dataArray [0x48 0x00 0x61 0x00 0x6c 0x00 0x6c
>0x00 0x6F 0x00]
***
as explained, that is what you are *supposed* to see. Consider using UTF-8 encoding; you
can look in my Asynchronous Socket example code (on my MVP Tips site) to see ways to get
UTF-8 data. Then, on the receiving side, you convert UTF-8 back to Unicode.
joe
****
>
>best regards
>Hans
Joseph M. Newcomer [MVP]
email: newc...@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

Joseph M. Newcomer

unread,
Sep 4, 2010, 11:41:49 PM9/4/10
to
See below...

****
Looks reasonable. It says
HTTP/
and is what you would find from the string literal
L"HTTP/"

Why should this be surprising? If you use Unicode, that's what you should expect to see!
joe

Joseph M. Newcomer

unread,
Sep 4, 2010, 11:53:27 PM9/4/10
to
Note that you are confusing several purposes here
- Rendering of characters in a control
- Transmission of characters across a communication channel

You have to decide what format you want for each.

For rendering of chracters in a control, your choices are the "ANSI" representation of
8-bit characters which are interpreted relative to a "code page" which is a mapping
between the values 0x00..0xFF and character glyphs. This is generally considered a deeply
losing idea. The only reasonable approach to what you store in controls is Unicode, and
in Windows that means using the UTF16LE encoding. This encoding encodes the values
0x00..0xFF as the low-order byte of a two-byte sequence, which in the LE encoding the
low-order byte is the first byte and the high-order byte is the second byte. So "ABC" is

0x41 0x00 0x42 0x00 0x43 0x00 0x00 0x00

If you make the mistake of using any method that expects 8-bit character strings, it will
truncate the string at the first character, so the trick is to never use any 8-bit
character functions (e.g., str*** functions; and in some cases, str***_s functions) or any
other method that expects 8-bit characters.

To transmit this data, it is common to convert it to a NUL-free representation, such as
UTF-8. UTF-8 is useful only for transmission; it has no meaning internally. You cannot
send it to controls. Upon receipt, you must convert it back to UTF16LE for use in
Windows.

In the case of HTTP protocols, you specify this as part of the header protocol string. The
recipient is expected to honor the specification of the encoding. I've lost track of what
is happening here, and it has been too many years since I last programmed HTTP-based
server-side stuff so I no longer recall all the details. Can you consolidate where you
currently are in this set of experiments?
joe

Joseph M. Newcomer

unread,
Sep 5, 2010, 12:03:41 AM9/5/10
to
See below...

On Fri, 3 Sep 2010 05:47:00 -0700 (PDT), mfc <mfc...@googlemail.com> wrote:

>ok it`s working now; but one problem still remains...
>
>If I add some Cyrillic characters into the html resource. After
>loading this resource (utf-16le format) and using
>WideCharToMultiByte(CP_UTF8...) - all Cyrillic characters will be
>shown wrong!
>
>I think I know the reason, because the character-length of the
>utf-16le string and the utf-8 string is not the same ->
>utf8string.GetLength() != utf16lestring.GetLength(). It seams that
>WideCharToMultiByte() won`t work correct if there are really 16bit
>characters...

****
There is no sane reason to expect that the length of a UTF-8 string is going to be equal
to a UTF-16LE string length. That's one of the reasons that if you use
WideCharToMultiByte with a NULL,0 specification for the destination buffer, it will return
you the length of the 8-bit buffer you need to allocate, which will include the space for
the terminal NUL character. You then provide a buffer of that length on the next call.
Similarly, MultiByteToWideChar with NULL, 0 tells you how many WCHARs to allocate,
including the terminal NUL, so you allocate a buffer of that length. Any other
expectation is erroneous.
joe
****


>
>By the way I stored the html resource as utf-8 (without signature) in
>vs. Maybe someone has a hint.
>
>best regards
>Hans

Joseph M. Newcomer

unread,
Sep 5, 2010, 12:46:03 AM9/5/10
to
See below...

On Fri, 3 Sep 2010 07:29:44 -0700 (PDT), mfc <mfc...@googlemail.com> wrote:

><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
><html xmlns="http://www.w3.org/1999/xhtml">
><head>
><title>Beschreibung der Seite</title>
></head>
><body>

>?


></body>
></html>
>
>In this small page is one Cyrillic character with the number 0x0411 ->
>if I use WideCharToMultiByte() I will get the following bytes for this
>line 0xc390e28098 0x0d0a; the last two bytes are \r\n. But why did I
>get 5 bytes for one character? I thought I will only get one...

****
You have not said what encoding you are using. Under UTF-8, you can expect one character
(16-bit) of Unicode will be encoded in somewhere between 1 and 4 bytes of UTF-8 encoding.
In particular, your Cyrillic character will take between 1 and 4 bytes. You can use my
Locale Explorer to find out exactly what the encoding is (I'd do it right now, but my
server died about ten minutes ago, and the only copy I have is on the server)..

In UTF-8 encoding, you have
0xC3 0x90 0xe2 0x80 0x98 0x0d 0x0a

Looking at this as binary, we have 7 bytes
1100 0011 * 1001 0000 * 1110 0010 * 1000 0000 * 1001 1000 * 0000 1101 * 0000 1010

So the first byte 1100 0011 says we have two bytes of encoding; and the second byte is
1001 0000. The third byte says it is a 3-byte encoding; the second byte of the encoding
is 1000 0000 and the third byte is 1001 1000. The high-order bit of the sixth byte is 0,
meaning it is a direct encoding, as is the seventh byte. So you have four characters
here. One encoded as 2 bytes, one encoded as 3 bytes, one encoded as 1 byte, and one
encoded as 1 byte.

00000000 0xxxxxxx => 0xxxxxxx
00000yyy yyxxxxxx => 110yyyyy 10xxxxxx
zzzzyyyy yyxxxxxx => 1110zzzz 10yyyyyy 10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx => 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

(this is from Page 103 of The Unicode Standard 5.0)

So in UTF-16LE, this would decode to
1100 0011 1001 000 => 00000000 11001000 => C8 00
1110 0010 1000 0000 1001 1000 => 0010 0000 0001 1000 => 20 18

THis seems odd, because 0x0411 should, according to these rules, encode as 0xD0 0x91.

Are you sure you are really encoding the correct string?
joe

Joseph M. Newcomer

unread,
Sep 5, 2010, 1:03:52 AM9/5/10
to
See below...
On Fri, 3 Sep 2010 06:22:47 -0700 (PDT), mfc <mfc...@googlemail.com> wrote:

>CString tester2 = _T("?");


>UINT size = tester2.GetLength();
>
>CStringA utf8;
>int len = WideCharToMultiByte(CP_UTF8, 0,
> tester2.GetString(),
> -1, NULL, 0, 0, 0);
>
>char *ptr = utf8.GetBuffer(len-1);

****
Why are you creating abuffer smaller than the length you need? You should be doing a
GetBuffer(len)!
****


>if (ptr) WideCharToMultiByte(CP_UTF8, 0, tester2.GetString(), -1, ptr,
>len, 0, 0);

***
This line is nearly unreadable; you should not put anything on the line after the if().

You did a GetBuffer of len-1 and you tell this API that you have a buffer of len! So it
is going to overwrite somebody else's storage...
****


>utf8.ReleaseBuffer();
>
>tester2 includes one Cyrillic character (0x044E) -> but the result in
>utf8 is not 0x04 0x4E...

****
OF COURSE IT ISN'T!!!! Look at the chart above. If you encode 0x044E then this is

0000 0100 0100 1110 => 1101 0001 1000 1110 => D18E

so why in the world would you expect the UTF-8 encoding to be 0x04 0x4E? This would not
be a valid UTF-8 encoding, because the first byte does not have the two high order bits of
the first byte set to 11, indicating a two-byte encoding, and the second byte does not
have the high-order bit set, indicating it is a secondary byte of a multibyte encoding. In
fact, in your expectation, what you have encoded is the input bytes 0x04 0x4E, as the
characters represending Ctrl+D and N. Not sure why you think you should transform a
cyrillic character to a control-D followed by N.

It would help if instead of guessing, you actually looked up what UTF8 actually is. google
would probably be a good start.
joe
****

0 new messages