I'm setting up a new DB2 database on a Red Hat Enterprise Linux 5.5 server.
For the first time, I set set it up using UTF-8, thinking that the time has come for this with DB2.
But now I'm seeing strange situations like:
CREATE TABLE foo (x VARCHAR(5)); INSERT INTO foo VALUES('XXĺXX'); SQL0433N Value "XXĺXX" is too long. SQLSTATE=22001 INSERT INTO foo VALUES('XXĺX')
SELECT * FROM foo
X ----- XXĺX
db2 "SELECT LENGTH(x) FROM foo"
1 ----------- 5
This is absurd.
I have tried various combinations of CREATE DATABASE specifications without luck.
Does DB2 still not support UTF-8 properly? Do I need to stay with latin1 databases for things to work as everyone would expect, or are there somewhat sane work-arounds which can be employed?
> I'm setting up a new DB2 database on a Red Hat Enterprise Linux 5.5 > server.
> For the first time, I set set it up using UTF-8, thinking that the time > has come for this with DB2.
> But now I'm seeing strange situations like:
> CREATE TABLE foo (x VARCHAR(5)); > INSERT INTO foo VALUES('XXĺXX'); > SQL0433N Value "XXĺXX" is too long. SQLSTATE=22001 > INSERT INTO foo VALUES('XXĺX')
> SELECT * FROM foo
> X > ----- > XXĺX
> db2 "SELECT LENGTH(x) FROM foo"
> 1 > ----------- > 5
> This is absurd.
> I have tried various combinations of CREATE DATABASE specifications > without luck.
> Does DB2 still not support UTF-8 properly? > Do I need to stay with latin1 databases for things to work as everyone > would expect, or are there somewhat sane work-arounds which can be > employed?
> Op 10-09-10 22:17, Troels Arvin schreef: >> Hello,
>> I'm setting up a new DB2 database on a Red Hat Enterprise Linux 5.5 >> server.
>> For the first time, I set set it up using UTF-8, thinking that the time >> has come for this with DB2.
>> But now I'm seeing strange situations like:
>> CREATE TABLE foo (x VARCHAR(5)); >> INSERT INTO foo VALUES('XXĺXX'); >> SQL0433N Value "XXĺXX" is too long. SQLSTATE=22001 >> INSERT INTO foo VALUES('XXĺX')
>> SELECT * FROM foo
>> X >> ----- >> XXĺX
>> db2 "SELECT LENGTH(x) FROM foo"
>> 1 >> ----------- >> 5
>> This is absurd.
>> I have tried various combinations of CREATE DATABASE specifications >> without luck.
>> Does DB2 still not support UTF-8 properly? >> Do I need to stay with latin1 databases for things to work as everyone >> would expect, or are there somewhat sane work-arounds which can be >> employed?
> Does DB2 still not support UTF-8 properly? > Do I need to stay with latin1 databases for things to work as everyone > would expect, or are there somewhat sane work-arounds which can be > employed?
In a unicode database the length of a (VAR)CHAR does not represent the number of characters, but the number of bytes.
In UTF8 a character can have up to 4 bytes. How should the database know which characters you will insert. If the database calculated 4 times the characters automatically, you would lose a lot of space when just inserting standard characters.
The problem that you are seeing is that umlauts or other accented characters are 2 bytes in unicode compared to the latin1 character sets where those characters are only 1 byte long.
-- Helmut K. C. Tessarek DB2 Performance and Development
/* Thou shalt not follow the NULL pointer for chaos and madness await thee at its end. */
> I'm setting up a new DB2 database on a Red Hat Enterprise Linux 5.5 > server.
> For the first time, I set set it up using UTF-8, thinking that the time > has come for this with DB2.
> But now I'm seeing strange situations like:
> CREATE TABLE foo (x VARCHAR(5)); > INSERT INTO foo VALUES('XXĺXX'); > SQL0433N Value "XXĺXX" is too long. SQLSTATE=22001 > INSERT INTO foo VALUES('XXĺX')
> SELECT * FROM foo
> X > ----- > XXĺX
> db2 "SELECT LENGTH(x) FROM foo"
> 1 > ----------- > 5
> This is absurd.
> I have tried various combinations of CREATE DATABASE specifications > without luck.
> Does DB2 still not support UTF-8 properly? > Do I need to stay with latin1 databases for things to work as everyone > would expect, or are there somewhat sane work-arounds which can be > employed?
> -- > Troels
What should work is to make the VARCHAR columns larger, but to use check constraints: db2 => create table t(id int, s varchar(20) check(char_length(s,codeunits32)<11)) DB20000I The SQL command completed successfully. db2 => insert into t values(1,'öüäöüä') DB20000I The SQL command completed successfully. db2 => insert into t values(2,'testtest') DB20000I The SQL command completed successfully. db2 => insert into t values(3,'testtesttest') DB21034E The command was processed as an SQL statement because it was not a valid Command Line Processor command. During SQL processing it returned: SQL0545N The requested operation is not allowed because a row does not satisfy the check constraint "HLOESER.T.SQL100913132928480". SQLSTATE=23513 db2 => select length(s),id from t
On Sep 13, 8:26 pm, Troels Arvin <tro...@arvin.dk> wrote:
> Henrik Loeser wrote: > > What should work is to make the VARCHAR columns larger, but to use check > > constraints:
> [...]
> Thanks. Yes, but this would entail a large amount of code inspection and > query/DDL rewrites - which is not an option.
> -- > Troels
With UTF-8/Unicode databases it's a trade-off between flexibility to have the option to store different characters vs. the usually more efficient local, but restricted codepages. It is your call.
On Sep 11, 2:47 am, Helmut Tessarek <tessa...@evermeet.cx> wrote:
> Hi Troels,
> > Does DB2 still not supportUTF-8properly? > > Do I need to stay with latin1 databases for things to work as everyone > > would expect, or are there somewhat sane work-arounds which can be > > employed?
> In a unicode database the length of a (VAR)CHAR does not represent the number > of characters, but the number of bytes.
Is it generally defined this way or is it a db2 definition? Anyhow, I'm wondering whether it has been discussed to introduce a string type that "hides" the additional space requirements? I.e. if I declare a column as new_varchar(3) it would reserve up to 12 bytes.
I can understand the rational by to days construction, but in many situations one have a database where the size of the data is small, but there are a lot of standard domains (say street address for example). These domains will require tweaking (multiply size and add check constraints as Henrik proposed).
> On Sep 11, 2:47 am, Helmut Tessarek <tessa...@evermeet.cx> wrote:
> > Hi Troels,
> > > Does DB2 still not supportUTF-8properly? > > > Do I need to stay with latin1 databases for things to work as everyone > > > would expect, or are there somewhat sane work-arounds which can be > > > employed?
> > In a unicode database the length of a (VAR)CHAR does not represent the number > > of characters, but the number of bytes.
> Is it generally defined this way or is it a db2 definition? Anyhow, > I'm wondering whether it has been discussed to introduce a string type > that "hides" the additional space requirements? I.e. if I declare a > column as new_varchar(3) it would reserve up to 12 bytes.
> I can understand the rational by to days construction, but in many > situations one have a database where the size of the data is small, > but there are a lot of standard domains (say street address for > example). These domains will require tweaking (multiply size and add > check constraints as Henrik proposed).
> /Lennart
I agree with the above statement. It gives DB2 a bad reputation having to explain this to developers over and over again. One can try the vargraphic data type, but that will hugely increase the size of your database.
> I agree with the above statement. It gives DB2 a bad reputation having > to explain this to developers over and over again. One can try the > vargraphic data type, but that will hugely increase the size of your > database.
Ok, now I'm a little bit confused. Using a new datatype which masks the byte size (in favor of all possible characters) or using VARGRAPHIC - what is the difference? In any case, we have to abide by the rules of computer science. No database will fit a 5 byte character string in a 4 byte field. Don't get me wrong, there might be databases which handle this differently, but the rules are the same and always will be. Just because other databases won't tell you that space is lost by doing a generalized datatype is not going to change the fact that space (and performance) is lost. I'm not sure how much I should go into detail in regards to programming, but I guess most of you do have extensive knowledge in programming and the internals of compiler and processor fundamentals.
So please, if you have a valid idea how to fit a 5 byte character string into a 4 byte field (which is just an example - it could also be a 20 byte string into a 5 byte field), humor me. Don't mention compression, because this we aready have. But we are not talking about compression, we are talking about the low level representation of data.
I always dreamt about unicode capable CPUs, but nobody is listening to me...
-- Helmut K. C. Tessarek DB2 Performance and Development
/* Thou shalt not follow the NULL pointer for chaos and madness await thee at its end. */
> > I agree with the above statement. It gives DB2 a bad reputation having > > to explain this to developers over and over again. One can try the > > vargraphic data type, but that will hugely increase the size of your > > database.
> Ok, now I'm a little bit confused. Using a new datatype which masks the byte > size (in favor of all possible characters) or using VARGRAPHIC - what is the > difference? > In any case, we have to abide by the rules of computer science. No database > will fit a 5 byte character string in a 4 byte field. Don't get me wrong, > there might be databases which handle this differently, but the rules are the > same and always will be. > Just because other databases won't tell you that space is lost by doing a > generalized datatype is not going to change the fact that space (and > performance) is lost. > I'm not sure how much I should go into detail in regards to programming, but I > guess most of you do have extensive knowledge in programming and the internals > of compiler and processor fundamentals.
> So please, if you have a valid idea how to fit a 5 byte character string into > a 4 byte field (which is just an example - it could also be a 20 byte string > into a 5 byte field), humor me. Don't mention compression, because this we > aready have. But we are not talking about compression, we are talking about > the low level representation of data.
> I always dreamt about unicode capable CPUs, but nobody is listening to me...
> -- > Helmut K. C. Tessarek > DB2 Performance and Development
> /* > Thou shalt not follow the NULL pointer for chaos and madness > await thee at its end. > */
Hello Helmut,
I really understand your argument, I know where the limit comes from. Do you want me to make a feature request to the Power team? ;-)
But from a users point of view, our problem is not solved. Using a vargraphic can be problematic because it will double the size of your database (assuming you only store strings), even if you restrict yourself to the ASCII characters 99% of the time. Perhaps this doubling can be reduced back to a reasonable amount by using compression, I never tested this.
I would also be very happy with basic functionality like this:
- the new varchar type reserves 4 times the space specified - the length is checked on the specified size (duh...) - all scalar functions (LENGTH, etc...) automatically apply the CODEUNITS32 keyword on this data type
I think that should be about it. Someone will probably come up with some other requirements, but i made my point.
The downside would be that we would have to move to a bigger pagesize more quickly. This would come from the fact that a row cannot span more than one page, in contrast to other database systems. Usually this is not a big deal, but it might be the reason they could already implement a "VARCHAR(20 CHAR)" approach. I personally wouldn't mind this limitation. With the 255 rows/page limit expanded, the chance that this would wast space is also greatly reduced.
On Oct 15, 11:37 am, Helmut Tessarek <tessa...@evermeet.cx> wrote:
> > I agree with the above statement. It gives DB2 a bad reputation having > > to explain this to developers over and over again. One can try the > > vargraphic data type, but that will hugely increase the size of your > > database.
> Ok, now I'm a little bit confused. Using a new datatype which masks the byte > size (in favor of all possible characters) or using VARGRAPHIC - what is the > difference?
Admittedly I did not even reflect over using VARGRAPHIC. I assume there are some things one should consider before choosing it. Does anyone have a link to the docs where varchar and vargraphic are compared, what restrictions that apply in each case, possible gotchas, etc?
> I would also be very happy with basic functionality like this:
> - the new varchar type reserves 4 times the space specified > - the length is checked on the specified size (duh...) > - all scalar functions (LENGTH, etc...) automatically apply the > CODEUNITS32 keyword on this data type
I'm on vacation right now, but when I'm back at work, I'll start a discussion with the different component owners (although I can't promise that there'll be a positive outcome :-))
What do you mean by 'the length is checked on the specified size'? Can you give me some examples?
Cheers, Helmut
-- Helmut K. C. Tessarek DB2 Performance and Development
/* Thou shalt not follow the NULL pointer for chaos and madness await thee at its end. */
> > I would also be very happy with basic functionality like this:
> > - the new varchar type reserves 4 times the space specified > > - the length is checked on the specified size (duh...) > > - all scalar functions (LENGTH, etc...) automatically apply the > > CODEUNITS32 keyword on this data type
> I'm on vacation right now, but when I'm back at work, I'll start a discussion > with the different component owners (although I can't promise that there'll be > a positive outcome :-))
> What do you mean by 'the length is checked on the specified size'? Can you > give me some examples?
> Cheers, > Helmut
> -- > Helmut K. C. Tessarek > DB2 Performance and Development
> /* > Thou shalt not follow the NULL pointer for chaos and madness > await thee at its end. > */
Helmut,
First of all, enjoy your holiday.
About that sentence, it would probably have been clearer if I just left it out. I'm pretty sure you already got that part.
All I meant to say is that the content of the cell shouldn't be allowed to be longer than whatever you sprcified at creation time, even though there is extra space reserved.
Thanks for taking this serious, I really appreciate that there is a such a channel to people from IBM possible here. Perhaps we can organise a popularity poll? Other people might think this is a bad idea.