UTF-8 databases

776 views
Skip to first unread message

Troels Arvin

unread,
Sep 10, 2010, 4:17:29 PM9/10/10
to
Hello,

I'm setting up a new DB2 database on a Red Hat Enterprise Linux 5.5
server.

For the first time, I set set it up using UTF-8, thinking that the time
has come for this with DB2.

But now I'm seeing strange situations like:

CREATE TABLE foo (x VARCHAR(5));
INSERT INTO foo VALUES('XXåXX');
SQL0433N Value "XXåXX" is too long. SQLSTATE=22001
INSERT INTO foo VALUES('XXåX')

SELECT * FROM foo

X
-----
XXåX

db2 "SELECT LENGTH(x) FROM foo"

1
-----------
5

This is absurd.

I have tried various combinations of CREATE DATABASE specifications
without luck.

Does DB2 still not support UTF-8 properly?
Do I need to stay with latin1 databases for things to work as everyone
would expect, or are there somewhat sane work-arounds which can be
employed?

--
Troels

Luuk

unread,
Sep 10, 2010, 4:48:35 PM9/10/10
to
Op 10-09-10 22:17, Troels Arvin schreef:

i think your question was answered here:
http://bytes.com/topic/db2/answers/644061-column-size-multibyte-data

--
Luuk

Luuk

unread,
Sep 10, 2010, 4:59:43 PM9/10/10
to
Op 10-09-10 22:48, Luuk schreef:

C:\Program Files>db2 SELECT LENGTH('XXåXX') from foo

1
-----------
6

1 record(s) selected.


C:\Program Files>db2 SELECT LENGTH('XX€XX') from foo

1
-----------
7

1 record(s) selected.


C:\Program Files>

--
Luuk

Tonkuma

unread,
Sep 10, 2010, 8:38:26 PM9/10/10
to
------------------------------ Commands Entered
------------------------------
SELECT LENGTH(x) AS "LENGTH(x)"
, CHARACTER_LENGTH(x , CODEUNITS16) AS "CHARACTER_LENGTH(x ,
CODEUNITS16)"
FROM (VALUES 'XX€XX') foo(x);
------------------------------------------------------------------------------

LENGTH(x) CHARACTER_LENGTH(x , CODEUNITS16)
----------- ---------------------------------
7 5

1 record(s) selected.


Helmut Tessarek

unread,
Sep 10, 2010, 8:47:22 PM9/10/10
to
Hi Troels,

> Does DB2 still not support UTF-8 properly?
> Do I need to stay with latin1 databases for things to work as everyone
> would expect, or are there somewhat sane work-arounds which can be
> employed?

In a unicode database the length of a (VAR)CHAR does not represent the number
of characters, but the number of bytes.

In UTF8 a character can have up to 4 bytes. How should the database know which
characters you will insert. If the database calculated 4 times the characters
automatically, you would lose a lot of space when just inserting standard
characters.

The problem that you are seeing is that umlauts or other accented characters
are 2 bytes in unicode compared to the latin1 character sets where those
characters are only 1 byte long.

--
Helmut K. C. Tessarek
DB2 Performance and Development

/*
Thou shalt not follow the NULL pointer for chaos and madness
await thee at its end.
*/

Henrik Loeser

unread,
Sep 13, 2010, 7:33:33 AM9/13/10
to

What should work is to make the VARCHAR columns larger, but to use
check constraints:
db2 => create table t(id int, s varchar(20)
check(char_length(s,codeunits32)<11))
DB20000I The SQL command completed successfully.
db2 => insert into t values(1,'öüäöüä')
DB20000I The SQL command completed successfully.
db2 => insert into t values(2,'testtest')
DB20000I The SQL command completed successfully.
db2 => insert into t values(3,'testtesttest')
DB21034E The command was processed as an SQL statement because it was
not a
valid Command Line Processor command. During SQL processing it
returned:
SQL0545N The requested operation is not allowed because a row does
not
satisfy the check constraint "HLOESER.T.SQL100913132928480".
SQLSTATE=23513
db2 => select length(s),id from t

1 ID
----------- -----------
12 1
8 2

2 record(s) selected.

You have to use CODEUNITS32 for UTF-8 data.

Henrik

Troels Arvin

unread,
Sep 13, 2010, 2:26:48 PM9/13/10
to
Henrik Loeser wrote:
> What should work is to make the VARCHAR columns larger, but to use check
> constraints:
[...]

Thanks. Yes, but this would entail a large amount of code inspection and
query/DDL rewrites - which is not an option.

--
Troels

Henrik Loeser

unread,
Sep 14, 2010, 1:46:18 AM9/14/10
to

With UTF-8/Unicode databases it's a trade-off between flexibility to
have the option to store different characters vs. the usually more
efficient local, but restricted codepages. It is your call.

Henrik

Lennart

unread,
Oct 15, 2010, 2:41:37 AM10/15/10
to
On Sep 11, 2:47 am, Helmut Tessarek <tessa...@evermeet.cx> wrote:
> Hi Troels,
>
> > Does DB2 still not supportUTF-8properly?
> > Do I need to stay with latin1 databases for things to work as everyone
> > would expect, or are there somewhat sane work-arounds which can be
> > employed?
>
> In a unicode database the length of a (VAR)CHAR does not represent the number
> of characters, but the number of bytes.
>

Is it generally defined this way or is it a db2 definition? Anyhow,
I'm wondering whether it has been discussed to introduce a string type
that "hides" the additional space requirements? I.e. if I declare a
column as new_varchar(3) it would reserve up to 12 bytes.

I can understand the rational by to days construction, but in many
situations one have a database where the size of the data is small,
but there are a lot of standard domains (say street address for
example). These domains will require tweaking (multiply size and add
check constraints as Henrik proposed).


/Lennart

Frederik Engelen

unread,
Oct 15, 2010, 3:55:44 AM10/15/10
to

I agree with the above statement. It gives DB2 a bad reputation having
to explain this to developers over and over again. One can try the
vargraphic data type, but that will hugely increase the size of your
database.

--
Frederik Engelen

Helmut Tessarek

unread,
Oct 15, 2010, 5:37:09 AM10/15/10
to
> I agree with the above statement. It gives DB2 a bad reputation having
> to explain this to developers over and over again. One can try the
> vargraphic data type, but that will hugely increase the size of your
> database.

Ok, now I'm a little bit confused. Using a new datatype which masks the byte
size (in favor of all possible characters) or using VARGRAPHIC - what is the
difference?
In any case, we have to abide by the rules of computer science. No database
will fit a 5 byte character string in a 4 byte field. Don't get me wrong,
there might be databases which handle this differently, but the rules are the
same and always will be.
Just because other databases won't tell you that space is lost by doing a
generalized datatype is not going to change the fact that space (and
performance) is lost.
I'm not sure how much I should go into detail in regards to programming, but I
guess most of you do have extensive knowledge in programming and the internals
of compiler and processor fundamentals.

So please, if you have a valid idea how to fit a 5 byte character string into
a 4 byte field (which is just an example - it could also be a 20 byte string
into a 5 byte field), humor me. Don't mention compression, because this we
aready have. But we are not talking about compression, we are talking about
the low level representation of data.

I always dreamt about unicode capable CPUs, but nobody is listening to me...

Frederik Engelen

unread,
Oct 15, 2010, 7:42:05 AM10/15/10
to

Hello Helmut,

I really understand your argument, I know where the limit comes from.
Do you want me to make a feature request to the Power team? ;-)

But from a users point of view, our problem is not solved. Using a
vargraphic can be problematic because it will double the size of your
database (assuming you only store strings), even if you restrict
yourself to the ASCII characters 99% of the time. Perhaps this
doubling can be reduced back to a reasonable amount by using
compression, I never tested this.

I would also be very happy with basic functionality like this:

- the new varchar type reserves 4 times the space specified
- the length is checked on the specified size (duh...)
- all scalar functions (LENGTH, etc...) automatically apply the
CODEUNITS32 keyword on this data type

I think that should be about it. Someone will probably come up with
some other requirements, but i made my point.

The downside would be that we would have to move to a bigger pagesize
more quickly. This would come from the fact that a row cannot span
more than one page, in contrast to other database systems. Usually
this is not a big deal, but it might be the reason they could already
implement a "VARCHAR(20 CHAR)" approach. I personally wouldn't mind
this limitation. With the 255 rows/page limit expanded, the chance
that this would wast space is also greatly reduced.

What do you think?

Kind regards,

Frederik Engelen


Lennart

unread,
Oct 15, 2010, 9:35:57 AM10/15/10
to
On Oct 15, 11:37 am, Helmut Tessarek <tessa...@evermeet.cx> wrote:
> > I agree with the above statement. It gives DB2 a bad reputation having
> > to explain this to developers over and over again. One can try the
> > vargraphic data type, but that will hugely increase the size of your
> > database.
>
> Ok, now I'm a little bit confused. Using a new datatype which masks the byte
> size (in favor of all possible characters) or using VARGRAPHIC - what is the
> difference?

Admittedly I did not even reflect over using VARGRAPHIC. I assume
there are some things one should consider before choosing it. Does
anyone have a link to the docs where varchar and vargraphic are
compared, what restrictions that apply in each case, possible gotchas,
etc?

/Lennart

[...]

Helmut Tessarek

unread,
Oct 16, 2010, 1:00:11 AM10/16/10
to
Hi Frederik,

On 15.10.10 1:42 , Frederik Engelen wrote:
> I would also be very happy with basic functionality like this:
>
> - the new varchar type reserves 4 times the space specified
> - the length is checked on the specified size (duh...)
> - all scalar functions (LENGTH, etc...) automatically apply the
> CODEUNITS32 keyword on this data type

I'm on vacation right now, but when I'm back at work, I'll start a discussion
with the different component owners (although I can't promise that there'll be
a positive outcome :-))

What do you mean by 'the length is checked on the specified size'? Can you
give me some examples?

Cheers,
Helmut

Frederik Engelen

unread,
Oct 16, 2010, 5:43:43 AM10/16/10
to

Helmut,

First of all, enjoy your holiday.

About that sentence, it would probably have been clearer if I just
left it out. I'm pretty sure you already got that part.

All I meant to say is that the content of the cell shouldn't be
allowed to be longer than whatever you sprcified at creation time,
even though there is extra space reserved.

Thanks for taking this serious, I really appreciate that there is a
such a channel to people from IBM possible here. Perhaps we can
organise a popularity poll? Other people might think this is a bad
idea.

--
Frederik Engelen

Reply all
Reply to author
Forward
0 new messages