Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Conversion problem between UTF-8 and Unicode characters.

262 views
Skip to first unread message

Sunray

unread,
Jul 2, 2009, 12:34:01 PM7/2/09
to

Do I misunderstand something?

This is in C#

When reading some UTF-8 from an Access notes field using ADO.NET I get the
UTF-8 characters held in a UTF-16 string. In this example I'm going to use
the hebrew het character 'ח' character. This is 5D7 unicode and D7 97 UTF-8.

Now I can convert the 5D7 character held in a C# string to its corresponding
a UTF-8 bytes easily.

UnicodeEncoding unicode = new UnicodeEncoding();
UTF8Encoding utf8 = new UTF8Encoding();

string het = "ח";
byte[] UnicodeHet = unicode.GetBytes(het);
byte[] UTF8Bytes = Encoding.Convert(unicode,utf8,UnicodeHet);

UTF8Bytes is then written to the database.

When I read this from the database I get two characters that represent the
UTF-8 string held in UTF-16 C# string. I can convert these back to the het
character using the following code

UnicodeEncoding unicode = new UnicodeEncoding();
UTF8Encoding utf8 = new UTF8Encoding();
Encoding local = Encoding.GetEncoding(1252);

string utf8het = "׳—"; //Normally read from the database but hardcoded here
byte[] utf8hetbytes = local.GetBytes(utf8het);
byte[] utf8result = Encoding.Convert(utf8,unicode,utf8hetbytes);
result = unicode.GetString(utf8result);

If the code page for the machine is set to 1252 this works correctly. e.g.
If the result from the database was a hebrew het character 'ח' character it
will return the utf-8 characters D7 97 in the byte sequence, which will be
correctly decoded to 5D7

Problem: If I subsequently change the code page of the machine to hebrew,

byte[] utf8hetbytes = local.GetBytes(utf8het);

will start returning 3F 97. 3F is ? which generally means a translation
error has occurred on the character.

Why?

If I switch to getting the default code page, it always works. Unfortunately
it appears the rest of the code (poor) requires 1252. Am I wrong in assuming
that if I get 1252 encoding it should not be effected by the code page of the
machine? It appears that I am faced with a bit of a major re-work due to
this.

Is there another way to get the two utf-8 bytes held in a C# string into a
byte array without going through a code page?

Thanks in advance

Alex

Nag

unread,
Sep 1, 2009, 10:49:15 PM9/1/09
to
Hi,
Is there any Sample Application or starter kit for building .Net
Internationalized web application.

-Nagendra

"Sunray" <Sun...@discussions.microsoft.com> wrote in message
news:00C8F20F-42E5-4031...@microsoft.com...

0 new messages