DB2 UTF-8 ODBC double conversion

Mihajlo Cvetanović

unread,

Nov 20, 2009, 11:02:23 AM11/20/09

to

I inherited a non-Unicode MFC project that connects to DB2 database
through ODBC. The application uses UTF-8 strings for database
transactions because the the strings are multilingual. In DB2 Control
Center I see those strings as "raw", but when calling SELECT
LENGTH(<some_column>) FROM <some_table> I see that each UTF-8 byte
occupies two bytes in database, as if there is extra conversion to UTF-8
taking place between ODBC and database. It seems that DB2 doesn't
understand that I'm working with UTF-8 strings and thinks that those
strings coming from ODBC are in system code page.

Is there a way to disable this extra conversion to UTF-8, and make DB2
aware that strings from ODBC are already UTF-8?

I realize this is more the question for some DB2 newsgroup, but it
involves MFC as well.

Tom Serface

unread,

Nov 20, 2009, 12:55:21 PM11/20/09

to

Is the data in the DB/2 database Unicode? If you're getting two bytes for
each then maybe there is a conversion being done some where. I don't think
this is necessarily a bad thing since it might save you some conversion
headaches if you just convert your MFC app to Unicode and use that all the
time.

My guess is the conversion is happening somewhere before it gets to the
database. Are you sure there is nothing like MultiByteToWideChar() being
called in your code somewhere?

Of course, doing ANSI and counting on codepage would be the worst way to
handle strings in a database that could be accessed from any language
install of the OS. In my experience going to Unicode hasn't really affected
the performance much.

Tom

"Mihajlo Cvetanović" <mcvetanovic@gmail> wrote in message
news:e9Xxarfa...@TK2MSFTNGP06.phx.gbl...

Giovanni Dicanio

unread,

Nov 20, 2009, 2:31:09 PM11/20/09

to

"Tom Serface" <t...@camaswood.com> ha scritto nel messaggio
news:OeyOvqga...@TK2MSFTNGP02.phx.gbl...

> In my experience going to Unicode hasn't really affected the performance
> much.

Actually, speaking of performance, my guess is that ANSI might be "slower"
(of course, it could not be noticeable by the end user), because modern
version of Windows like XP or 7 implement the ANSI version (DoSomethingA) of
the APIs doing conversion from ANSI to Unicode, and calling the Unicode
versions (DoSomethingW).
So ANSI has a conversion overhead.

G

Giovanni Dicanio

unread,

Nov 20, 2009, 2:34:05 PM11/20/09

to

"Tom Serface" <t...@camaswood.com> ha scritto nel messaggio
news:OeyOvqga...@TK2MSFTNGP02.phx.gbl...

> Is the data in the DB/2 database Unicode? If you're getting two bytes for

> each then maybe there is a conversion being done some where.

I read here:

http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/admin/c0004821.htm

that:

"DB2(R) Universal Database (DB2 UDB) supports UTF-8 and UCS-2"

So if the OP is getting two bytes for each character probably the DB is
using UCS-2 ?

G

Tom Serface

unread,

Nov 21, 2009, 1:39:03 AM11/21/09

to

I think you are right. It does use more memory, of course, but there is far
less conversion to be done and I think, ultimately, the data is handled
faster.

Tom

"Giovanni Dicanio" <giovanniD...@REMOVEMEgmail.com> wrote in message
news:uQ4RHgha...@TK2MSFTNGP04.phx.gbl...

Tom Serface

unread,

Nov 21, 2009, 1:38:35 AM11/21/09

to

Or the conversion is happening before it gets to the database.

Tom

"Giovanni Dicanio" <giovanniD...@REMOVEMEgmail.com> wrote in message

news:ObC7vhha...@TK2MSFTNGP04.phx.gbl...

Mihajlo Cvetanovi�

unread,

Nov 21, 2009, 4:08:47 AM11/21/09

to

I wasn't accurate enough. It is not true that all bytes I send to database
occupy two bytes there. That is true only for bytes outside the basic ASCII
range. If the string contains only numbers and english letters then the DB2
length of the string in bytes is the count of characters +1 for NUL (as
expected).

The thing that happened is that DB2 couldn't execute one operation because
one string was too long. Now, when I catch a CDBException within debugger
and in Watch window dig into bound parameters of UPDATE query I see strings
in memory as they are sent to ODBC driver. One of the characters in one
string is UTF-8 character U201A (SINGLE LOW-9 QUOTATION MARK), which MFC
sends to DB2 as 3 bytes, but this actually takes 6 bytes in database. I was
expecting it would take 3 bytes in database as well, because the database is
set to UTF-8.

I am 99% sure that there is conversion from system code page to UTF-8
between DB2 ODBC driver and DB2 database. Therefore I'm technically in the
wrong group, but was kinda hoping someone knows about this already.
microsoft.public.vc.mfc is more responsive group than most. My question is
how to avoid this conversion?

Giovanni Dicanio

unread,

Nov 21, 2009, 5:35:12 AM11/21/09

to

"Mihajlo Cvetanovi�" <mcvetanovic@gmail> ha scritto nel messaggio
news:OmyPCpoa...@TK2MSFTNGP04.phx.gbl...

> I wasn't accurate enough. It is not true that all bytes I send to database

I assume: "that all bytes" --> "that all characters"

> occupy two bytes there. That is true only for bytes outside the basic
> ASCII range. If the string contains only numbers and english letters then
> the DB2 length of the string in bytes is the count of characters +1 for
> NUL (as expected).

So probably DB2 is actually using UTF-8.

> The thing that happened is that DB2 couldn't execute one operation because
> one string was too long. Now, when I catch a CDBException within debugger
> and in Watch window dig into bound parameters of UPDATE query I see
> strings in memory as they are sent to ODBC driver. One of the characters
> in one string is UTF-8 character U201A (SINGLE LOW-9 QUOTATION MARK),
> which MFC sends to DB2 as 3 bytes, but this actually takes 6 bytes in
> database. I was expecting it would take 3 bytes in database as well,
> because the database is set to UTF-8.

If you go there:

http://www.fileformat.info/info/unicode/char/201a/index.htm

you can see that 0x201A is actually Unicode UTF-16, and corresponding UTF-8
is 0xE2 0x80 0x9A.
So probably MFC is sending the UTF-8 encoding to the DB2.
Can you see the actual bytes that MFC is sending?
What is the actual value of the 6 bytes?

Giovanni

Mihajlo Cvetanovic

unread,

Nov 21, 2009, 6:58:48 AM11/21/09

to

"Giovanni Dicanio" <giovanniD...@REMOVEMEgmail.com> wrote in message

news:ueTdUZpa...@TK2MSFTNGP04.phx.gbl...

> "Mihajlo Cvetanovi�" <mcvetanovic@gmail> ha scritto nel messaggio
> news:OmyPCpoa...@TK2MSFTNGP04.phx.gbl...
>
>> I wasn't accurate enough. It is not true that all bytes I send to
>> database
>
> I assume: "that all bytes" --> "that all characters"

My code is using UTF-8 CStrings. My application is non-Unicode, but I must
have Unicode strings, so CStrings are UTF-8 strings when dealing with pretty
much everything except GUI. But when sending them through ODBC they are
technically strings of bytes, and my guess is that ODBC is interpreting them
as strings in system code page instead of as UTF-8 strings..

> So probably DB2 is actually using UTF-8.

Yes.

> you can see that 0x201A is actually Unicode UTF-16, and corresponding
> UTF-8 is 0xE2 0x80 0x9A.
> So probably MFC is sending the UTF-8 encoding to the DB2.

Yes.

> Can you see the actual bytes that MFC is sending?

Yes, they are UTF-8 strings in memory. U201A character is represented with 3
bytes you gave here (0xE2 0x80 0x9A). There is no internal conversion. Those
strings in memory go to SQL SDK functions within MFC.

> What is the actual value of the 6 bytes?

I dont know at this time (I could find out on Monday), but I'm 99% certain
that they are those 3 bytes you gave here, but converted to UTF-8 again. All
I know for now is that they occupy 7 bytes in database (6 + NUL).

Giovanni Dicanio

unread,

Nov 21, 2009, 7:17:48 AM11/21/09

to

"Mihajlo Cvetanovic" <mcvetanovic@gmail> ha scritto nel messaggio
news:eiTd9Hqa...@TK2MSFTNGP04.phx.gbl...

> My code is using UTF-8 CStrings. My application is non-Unicode, but I must
> have Unicode strings, so CStrings are UTF-8 strings when dealing with
> pretty much everything except GUI.

You are storing UTF-8 data in CString's in a non-Unicode MFC app, so CString
is actually CStringA.
OK.
(Note that you may want to use CStringA explicitly or std::string, so even
if you switch your app to Unicode, your strings will continue storing UTF-8
data.)

>> Can you see the actual bytes that MFC is sending?
>
> Yes, they are UTF-8 strings in memory. U201A character is represented with
> 3 bytes you gave here (0xE2 0x80 0x9A). There is no internal conversion.
> Those strings in memory go to SQL SDK functions within MFC.

OK.

>> What is the actual value of the 6 bytes?
>
> I dont know at this time (I could find out on Monday), but I'm 99% certain
> that they are those 3 bytes you gave here, but converted to UTF-8 again.
> All I know for now is that they occupy 7 bytes in database (6 + NUL).

So, let's wait on Monday.

Giovanni

Mihai N.

unread,

Nov 23, 2009, 2:57:31 AM11/23/09

to

> Is there a way to disable this extra conversion to UTF-8, and make DB2
> aware that strings from ODBC are already UTF-8?

First gut feeling: there are components in Windows (except for code page
conversions APIs) that take utf-8 strings as input/output.
So I strongly doubt ODBC is an exception to that.
If you don't use wchar_t (or WCHAR, same thing), meaning UTF-16, then
you probably use char* so that means ANSI code page for ODBC and does
(the extra) conversion.

You might get away with double conversion if you run on an English system,
but I anticipate troubles for anything non-1252 (think adding stuff to DB
from English OS and reading it on a Japanese one).

I would say that you should try to byte the bullet and "talk" utf-16
with ODBC.

--
Mihai Nita [Microsoft MVP, Visual C++]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email

Mihai N.

unread,

Nov 23, 2009, 3:03:55 AM11/23/09

to

> My code is using UTF-8 CStrings. My application is non-Unicode, but I must
> have Unicode strings, so CStrings are UTF-8 strings when dealing with
> pretty much everything except GUI.

There is no such thing as UTF-8 CStrings.
You might store UTF-8 bytes in there, but pretty much anybody outside your
application will interpret it as ANSI code page.
So to "pretty much everything except GUI" should change to:
Pretty much everything except
- GUI
- ODBC
- Registry access
- File system
- Clipboard
- network comunication (unless you do your own low-level
thing and don't use the Win APIs/MFC classes)
- who knows what else

Basically using UTF-8 is asking for trouble.
In the long run you should really consider going utf-16.
This is the only Unicode encoding in the Windows world.

> But when sending them through ODBC they are
> technically strings of bytes, and my guess is that ODBC is interpreting
> them as strings in system code page instead of as UTF-8 strings..

Correct. This is expected.

Joseph M. Newcomer

unread,

Nov 23, 2009, 9:26:54 AM11/23/09

to

UTF-8 is used "at the boundaries" when you bring data in or write data out, if you use it
at all. It should never be used internally. Use Unicode internally.
joe

On Mon, 23 Nov 2009 00:03:55 -0800, "Mihai N." <nmihai_y...@yahoo.com> wrote:

>
>> My code is using UTF-8 CStrings. My application is non-Unicode, but I must
>> have Unicode strings, so CStrings are UTF-8 strings when dealing with
>> pretty much everything except GUI.
>
>There is no such thing as UTF-8 CStrings.
>You might store UTF-8 bytes in there, but pretty much anybody outside your
>application will interpret it as ANSI code page.
>So to "pretty much everything except GUI" should change to:
>Pretty much everything except
> - GUI
> - ODBC
> - Registry access
> - File system
> - Clipboard
> - network comunication (unless you do your own low-level
> thing and don't use the Win APIs/MFC classes)
> - who knows what else
>
>Basically using UTF-8 is asking for trouble.
>In the long run you should really consider going utf-16.
>This is the only Unicode encoding in the Windows world.
>
>
>> But when sending them through ODBC they are
>> technically strings of bytes, and my guess is that ODBC is interpreting
>> them as strings in system code page instead of as UTF-8 strings..
>
>Correct. This is expected.

Joseph M. Newcomer [MVP]
email: newc...@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

Mihajlo Cvetanović

unread,

Nov 23, 2009, 9:54:07 AM11/23/09

to

Giovanni Dicanio wrote:
> So, let's wait on Monday.

Problem solved. All I had to do is call one line in command prompt:

db2set DB2CODEPAGE=1208

Now it works, UTF-8 strings in non-Unicode application go to DB2 as they
should. Thank you all.

Tom Serface

unread,

Nov 23, 2009, 11:12:37 AM11/23/09

to

Glad you got it figured out. That was an interesting one even if it was
not specifically MFC related.

I use UTF-8 to store data outside of my program (files), but I always
convert to Unicode when I come back in.

You may still want to go to Unicode in your application as some point
because being dependent on the code page for the conversion is eventually
going to become more trouble for you than it's worth IMO. But for now
sounds like it works for you.

Tom

"Mihajlo Cvetanović" <mcvetanovic@gmail> wrote in message

news:um60QzEb...@TK2MSFTNGP02.phx.gbl...

Tom Serface

unread,

Nov 23, 2009, 11:14:46 AM11/23/09

to

I wish MSFT would get better at working with UTF-8. It is more trouble to
work with in memory, of course, but for storing files or database data (like
OP is doing) it makes a great deal of sense. In cases like ourse where most
of the time users only need one byte characters, but we still need to
support MBCS it means our external files are only bigger if they need to be.
The conversion from UTF-8 to Unicode is very fast in memory (from what I've
seen).

Tom

"Joseph M. Newcomer" <newc...@flounder.com> wrote in message
news:1s6lg5tgq9th8j09e...@4ax.com...

David Wilkinson

unread,

Nov 23, 2009, 12:23:35 PM11/23/09

to

Joseph M. Newcomer wrote:
> UTF-8 is used "at the boundaries" when you bring data in or write data out, if you use it
> at all. It should never be used internally. Use Unicode internally.

UTF-8 *is* Unicode. "Windows Unicode" is UTF-16, stored in wchar_t strings (with
surrogate pairs where necessary).

But I think we agree that the thing to avoid, at all costs, is having anything
to do with the local code page. It is a pity, IMHO, that CString (and other
parts of MFC) has automatic conversions that assume the local code page for
8-bit strings. These features can be very dangerous if you want an application
that uses UTF-8 "at the boundaries", especially if you only test your code with
ASCII strings (sloppy, I agree, but it happens...).

--
David Wilkinson
Visual C++ MVP

Joseph M. Newcomer

unread,

Nov 23, 2009, 12:44:53 PM11/23/09

to

Well, let me rephrase that: use UTF-16. UTF-8 is an encoding of Unicode, and now that
Unicode is 32-bit, UTF-16 is also an encoding, but it is the native encoding of Windows,
whereas UTF-8 is not.

I always do my own encode/decode when I use UTF-8.
joe

Giovanni Dicanio

unread,

Nov 23, 2009, 12:43:36 PM11/23/09

to

"Mihajlo Cvetanović" <mcvetanovic@gmail> ha scritto nel messaggio
news:um60QzEb...@TK2MSFTNGP02.phx.gbl...

> Problem solved.

Great.

> All I had to do is call one line in command prompt:
>
> db2set DB2CODEPAGE=1208

OK, so you set the UTF8 code page identifier in DB2.

> Now it works, UTF-8 strings in non-Unicode application go to DB2 as they
> should. Thank you all.

Glad you solved that.

BTW: I second Tom's (and others) suggestion to move your app to Unicode.

Giovanni

Tom Serface

unread,

Nov 23, 2009, 2:34:58 PM11/23/09

to

We windows programmers often fall in to the trap of considering Unicode ==
UTF-16 and often UTF-16 LE even though the Windows Unicode is not exactly
any of them that I can see. You're right abotu CString doing auto
conversions. It can get someone into trouble that they didn't even know
they had.

Tom

"David Wilkinson" <no-r...@effisols.com> wrote in message
news:OFoxwGGb...@TK2MSFTNGP05.phx.gbl...

Tim Slattery

unread,

Nov 23, 2009, 4:07:49 PM11/23/09

to

David Wilkinson <no-r...@effisols.com> wrote:

>Joseph M. Newcomer wrote:
>> UTF-8 is used "at the boundaries" when you bring data in or write data out, if you use it
>> at all. It should never be used internally. Use Unicode internally.
>
>UTF-8 *is* Unicode.

I don't think so, not exactly. A file encoded in UTF-8 uses a single
byte to store characters in the 7-bit ASCII code. For other characters
it uses a multi-byte sequence to store the appropriate Unicode value:
sometimes 2 bytes, sometimes 3, sometimes 4 bytes for a single Unicode
character.

See the Wikipedia article: http://en.wikipedia.org/wiki/UTF8

--
Tim Slattery
Slatt...@bls.gov
http://members.cox.net/slatteryt

Joseph M. Newcomer

unread,

Nov 23, 2009, 6:14:53 PM11/23/09

to

See below...

On Mon, 23 Nov 2009 16:07:49 -0500, Tim Slattery <Slatt...@bls.gov> wrote:

>David Wilkinson <no-r...@effisols.com> wrote:
>
>>Joseph M. Newcomer wrote:
>>> UTF-8 is used "at the boundaries" when you bring data in or write data out, if you use it
>>> at all. It should never be used internally. Use Unicode internally.
>>
>>UTF-8 *is* Unicode.
>
>I don't think so, not exactly. A file encoded in UTF-8 uses a single
>byte to store characters in the 7-bit ASCII code. For other characters
>it uses a multi-byte sequence to store the appropriate Unicode value:
>sometimes 2 bytes, sometimes 3, sometimes 4 bytes for a single Unicode
>character.

****
I think the issue is whether or not UTF-8 is functionally equivalent to Unicode.
Essentially, from an *information* content perspective, UTF-8 == UTF-16[LB]E == UTF32[LB]E

But UTF-8 is independent of the endianness of the sender and receiver, and as such means
that you can transparently read a UTF-8 stream into ANY endian machine, converting it to
UTF-16 or UTF-32 internally using the native endianness of the machine, manipulate it
using the standard Unicode capabilities (OK, we'll still have to deal with surrogates with
UTF-16, and that remains a problem, but the problem is independent of the endianness).
Then you can write the data back out in UTF-8 (native endian to UTF-8), and have an
endian-independent representation again.

Now Windows uses UTF-16LE as its native encoding. If I give a UTF-8 string to CreateFile,
I'll be calling CreateFileA, and get a weird-looking filename if any character in that
string has a numeric value > 127. If I'm interfacing to a database system, I need to know
what *its* native representation of a string is. If it stores 8-bit strings, I have to
store a UTF-8 encoding if I want to retain the information content of the original string.
Of course, this means that everyone who is using that database has to understand that the
strings are being stored in UTF-8. The values can be handed around as uninterpreted
character sequences; they can be compared for equality (but not collating sequence),
written back into other string fields of the database, and transmitted out to other
places, all of which have to understand they are dealing with UTF-8. Because it is a
sequence of endian-independent characters with no embedded 0 bytes (which, if a native
UTF-LE or UTF-BE string is sent to a context that expects an 8-bit string will cause it to
malfunction), it can be handed around as a "black box" value that has very little that can
be done with it; for example, it cannot be displayed, it cannot be compared for collating
order, it cannot be used in any context in which the characters have significance (such as
a file name). Because this can become confusing, and at various points you end up having
to convert to or from UTF-16[BL]E so the appropriate API has a well-formed character
string, it is usually simpler to keep the representation in the program entirely in
UTF-16[BL]E (that is, in a Windows app, UTF-16LE), and convert it only at the boundaries
where it gets transmitted, or in the case of a database limited to 8-bit characters,
stored. Life is simpler if only the external forms are UTF-8.

Which leads to the interesting question about database sizes. Suppose I want to store an
80-character string. In ANSI, I need 80 bytes to store that string. But in UTF-16
encodings, I would need 160 bytes. Sounds bad. But realize that to hold 80 *characters*
if I have UTF-8, I could potentially need ***320*** bytes, since each of the UTF-8
encodings could be 4 8-bit bytes. If I must guarantee 80 characters, I have to allow for
80 UTF-8-encoded characters. So, you argue, the number of 4-byte encodings is small, and
applies only to "foreign" languages; because of the richness of the ideographic languages,
while I might need 4 bytes to encode the information, the languages are so rich that I
might need only 20 glyphs in these languages, so my 80-byte field can hold 20 Chinese, or
Sumerian, or Cuneiform glyphs. Well, now try to explain this to someone using one of the
UTF-32 glyphs whose value is > 65536. Now in UTF-16, I would need surrogates, so to
represent 80 glyphs I would need 80 surrogate+value pairs, which is 160 UTF-16 values, or
320 bytes. So whether I use UTF-8 or UTF-16 with surrogates, my field has to be 320 bytes
in length. So the file sizes are the same! Note that it is not just dead languages like
Cuneiform that are in that region, so are you going to try to explain to someone whose
native language is expressed in the range > 65535 that they can't use your database to
represent as many characters as someone who is using, say, American English. Note that
you have to now think about new kinds of abstraction; if you want 80 characters, the limit
has to be computed purely on the actual characters, independent of the encoding. So while
the database field might be 320 bytes in lenght, I have to artificially impose on it a
limit of 80 *characters*. If you give me a "UTF-8 encoded" string that only has the 7-bit
subset (space thru ~, to simplify the explanation), then you would have to limit the field
to 80 bytes (losing 160 bytes in each record); if you get something that has character
sequences in the range 128..65535, encoded in UTF-16, you would limit it to 80 characters,
by counting the number of characters, or 160 bytes, losing, well, 160 bytes in each
record. In UTF-8, you would have to count the number of UTF-16 characters, and truncate
at 80.

For codes that require surrogates, you would essentially have to encode in UTF32, and
count characters. Your lossage woud vary from between 316 and 0 bytes (1-character UTF-32
to 80 UTF-32 characters in length).

If using UTF-8 or UTF-16, you would have to convert to UTF-32, limit the length to 80
UTF-32 characters, then convert back to UTF-16 or UTF-8 to the string you want, which
could be somewhere between 1 and 320 bytes in length.

So it is not at all clear that if a database supports only 8-bit characters that you would
actually have "smaller" files if you really want to maintain identical functionality
across all possible customer character sets.

Of course, the ultimate answer do making smaller files is to store your database with
variable-length strings. This has the advantage that it saves space, but now you have the
time performance to consider. You can't locate record N by computing a file offset of
base_offset + N * (record_length), updates can be more expensive (for those of us who grew
up with IBM's ISAM variable-length databases, the whole issue of overflow pages is a
natural paradigm, but it is slow, expensive in time and disk space, and in those days it
was essential when BIG disk drives held tens of megabytes) Today, it is not clear that
disk space has any meaning whatsoever until you get into the gigarecord-sized databases; I
just added 4TB of RAID storage for $320 plus the cost of the iSCSI rack to hold them.
joe
****

>
>See the Wikipedia article: http://en.wikipedia.org/wiki/UTF8

David Wilkinson

unread,

Nov 23, 2009, 9:02:21 PM11/23/09

to

Tim Slattery wrote:
> I don't think so, not exactly. A file encoded in UTF-8 uses a single
> byte to store characters in the 7-bit ASCII code. For other characters
> it uses a multi-byte sequence to store the appropriate Unicode value:
> sometimes 2 bytes, sometimes 3, sometimes 4 bytes for a single Unicode
> character.

For some code points UTF-16 uses two code units. There is really no conceptual
difference between UTF-8 and UTF-16. Both are full encodings of the Unicode
standard.

Mihai N.

unread,

Nov 24, 2009, 3:28:13 AM11/24/09

to

> I always do my own encode/decode when I use UTF-8.

In general I recomend against that.
It is quite easy to get wrong and not validate it properly.
I have seen even RFCs giving wrong algorithms for that.
Not saying that yours is bad, I have not seen it.

But getting it right is something the average Joe
(in contrast to the Newcomer Joe :-) will probably
fail to do.

For people that did not want to use the Windows stuff
I used to recomend the conversion routines at
ftp://ftp.unicode.org/Public/PROGRAMS/CVTUTF
But they are now gone!
(on 10/31/2009 they were still there)

I have to ask why that happened.

Mihai N.

unread,

Nov 24, 2009, 3:49:58 AM11/24/09

to

> Essentially, from an *information* content perspective,
> UTF-8 == UTF-16[LB]E == UTF32[LB]E

Unicode considers the various UTFs flavors completely equivalent.
Just various encoding forms for the same thing.
Unicode itself is not 32 bit, or 8, or anything.
It is just a mapping from characters to numbers plus a collection of
character properties.

> If I'm interfacing to a database system, I need to know
> what *its* native representation of a string is.
> If it stores 8-bit strings, I have to
> store a UTF-8 encoding if I want to retain the information content
> of the original string.
> Of course, this means that everyone who is using that database has
> to understand that the
> strings are being stored in UTF-8.

I would argue that I should not have to care about the internal encoding
of the database.
The correct types used should be NCHAR, NVARCHAR and NTEXT.
The public API should take UTF-16 or UTF-32 or UTF-8 and document it.
Any conversion between the public API text representation and the internal
format should be transparent.

Also the database should be aware that text stored is Unicode, and not
just a bunch of bytes.
Becase otherwise things like sorting (and functions like between),
case-insensitive searching, functions like substring, replace, like,
% (one or more characters), _ (one character), will not do the right thing.

Stuff can be move around without awarenes of what is in there, but one has to
be very careful what operations are save and wich ones are not
(pretty much liks storing utf-8 in CString).

> So whether I use UTF-8 or UTF-16 with surrogates, my field has to be 320
> bytes in length.

100% agree.

> Note that it is not just dead languages like
> Cuneiform that are in that region, so are you going to try to explain to
> someone whose native language is expressed in the range > 65535 that
> they can't use your database to represent as many characters as
> someone who is using, say, American English

To make it more real: characters beyond BMP (Basic Multilingual Plane)
are required in order to support the GB-18030 Chinese National standard.
And the standard is enforced. If you want to sell your software in China,
you have to get a GB-18030 certification, or you don't sell it.

Also national standard in Japan and Hong Kong require support for
characters above U+FFFF. Although the standard are not enforced like the
Chinese one, supporting them might give you an extra edge in a competitive
market.

So beyond BMP it is not about some extinct languages that only few
archeologists care about.

Joseph M. Newcomer

unread,

Nov 24, 2009, 10:11:14 AM11/24/09

to

When I said "My own encode/decode" I was referring to using

MultiByteToWideChar(CP_UTF8,...)
or
WideCharToMultiByte(CP_UTF8,...)

and not that I was actualy writing my own algorithms to do it.
joe

On Tue, 24 Nov 2009 00:28:13 -0800, "Mihai N." <nmihai_y...@yahoo.com> wrote:

>> I always do my own encode/decode when I use UTF-8.
>
>In general I recomend against that.
>It is quite easy to get wrong and not validate it properly.
>I have seen even RFCs giving wrong algorithms for that.
>Not saying that yours is bad, I have not seen it.
>
>But getting it right is something the average Joe
>(in contrast to the Newcomer Joe :-) will probably
>fail to do.
>
>For people that did not want to use the Windows stuff
>I used to recomend the conversion routines at
> ftp://ftp.unicode.org/Public/PROGRAMS/CVTUTF
>But they are now gone!
>(on 10/31/2009 they were still there)
>
>I have to ask why that happened.

Joseph M. Newcomer

unread,

Nov 24, 2009, 10:15:31 AM11/24/09

to

See below...

On Tue, 24 Nov 2009 00:49:58 -0800, "Mihai N." <nmihai_y...@yahoo.com> wrote:

>> Essentially, from an *information* content perspective,
>> UTF-8 == UTF-16[LB]E == UTF32[LB]E
>
>Unicode considers the various UTFs flavors completely equivalent.
>Just various encoding forms for the same thing.
>Unicode itself is not 32 bit, or 8, or anything.
>It is just a mapping from characters to numbers plus a collection of
>character properties.
>
>
>> If I'm interfacing to a database system, I need to know
>> what *its* native representation of a string is.
>> If it stores 8-bit strings, I have to
>> store a UTF-8 encoding if I want to retain the information content
>> of the original string.
>> Of course, this means that everyone who is using that database has
>> to understand that the
>> strings are being stored in UTF-8.
>
>I would argue that I should not have to care about the internal encoding
>of the database.
>The correct types used should be NCHAR, NVARCHAR and NTEXT.
>The public API should take UTF-16 or UTF-32 or UTF-8 and document it.
>Any conversion between the public API text representation and the internal
>format should be transparent.

****
In the ideal world, that would be a great idea. Far too many databases are stuck in the
8-bit world, and a few understand UTF-16. So ultimately the user has to be aware of the
nature of the database. I was talking about the idea that the database is not smart
enough to be aware of character encodings.

I should send the data with something that says what the encoding is, and it should worry
about the details. AFAIK, that isn't something that is supported.
****

****
I didn't recognize all the lists above the BMP (I keep the Unicode 5.0 book always within
arm's reach, right next to the computer), but I chose the Phoenician and Cuneiform as
examples, but covered the others by inclusion.
joe
****

Mihai N.

unread,

Nov 25, 2009, 1:31:25 AM11/25/09

to

> and not that I was actualy writing my own algorithms to do it.

I agree that using the win api is the safe way.
Although I am sure you can do your own stuff, if you want :-)

Mihai N.

unread,

Nov 25, 2009, 1:34:20 AM11/25/09

to

> but I chose the Phoenician and Cuneiform as
> examples, but covered the others by inclusion.

There are quite a few things beyond BMP.
But I prefer to give the examples of Chinese and Japanese,
because people tend to say "I don't need, and will
never need to support Phoenician or Cuneiform"

Joseph M. Newcomer

unread,

Nov 25, 2009, 12:20:30 PM11/25/09

to

I will use that certification requirement for future discussions.
joe

On Tue, 24 Nov 2009 22:34:20 -0800, "Mihai N." <nmihai_y...@yahoo.com> wrote:

>> but I chose the Phoenician and Cuneiform as
>> examples, but covered the others by inclusion.
>
>There are quite a few things beyond BMP.
>But I prefer to give the examples of Chinese and Japanese,
>because people tend to say "I don't need, and will
>never need to support Phoenician or Cuneiform"