Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

reading utf8 from mysql

3 views
Skip to first unread message

Brigitte Jellinek

unread,
Jun 17, 2003, 5:57:25 AM6/17/03
to dbi-...@perl.org

hi!

I'm not sure this is supposed to work yet.... I write utf-8 into mysql -
that seems to work. but when I read it back, not only do I get
a "classical perl string" instead of an utf8 string, but also some
of the bits are broken:

he string i write into the db is 6 characters long:
"A\N{greek:beta}cd\N{cyrillic:e}f"

character unicode utf8
hex binary

A 0041 01000001
? 03B2 1100111010110010
c 0063 01100011
d 0064 01100100
? 044D 1101000110001101
f 0066 01100110

what i get back from the db is

A 01000001
? 11001110
? 10110010
c 01100011
d 01100100
? 11010001
? 00111111
f 01100110


look at the penultimate byte! Is that supposed to happen?


PS.
I compiled everything myself, and used the following versions:
perl(5.8) + dbi(1.37) + dbd::mysql(2.1026) + mysql(4.1.0-alpha)

Here's the code I use:

$dbh = DBI->connect("DBI:mysql:test", "someuser", "somepass");

$uni = "A\N{greek:beta}cd\N{cyrillic:e}f";

$dbh->do("INSERT INTO unitest VALUES (NULL,?,?)", undef, $uni, $uni)
or print $dbh->errstr;

$id = $dbh->{insertid};

$sth = $dbh->prepare("SELECT * FROM unitest WHERE ID = ?");
$sth->execute( $id );
@row = $sth->fetchrow_array();


The table is a utf-8 table:

CREATE TABLE unitest (
ID mediumint(9) NOT NULL auto_increment,
vc varchar(80) default NULL,
t text,
PRIMARY KEY (ID)
) TYPE=MyISAM CHARSET=utf8;

PPS.: I keep a journal about my adventures with perl+mysql+unicode
at http://perlwelt.horus.at/Beispiele/Magic/PerlUnicodeMysql/

--
Brigitte 'I never met a chocolate I didnt like' Jellinek
bje...@horus.at http://www.horus.at/~bjelli
http://perlwelt.horus.at http://www.perlmonks.org/index.pl?node=bjelli

Tim Bunce

unread,
Jun 17, 2003, 6:36:13 AM6/17/03
to Brigitte Jellinek, dbi-...@perl.org
Please try using another tool (eg mysqldump) to read the data from
the db to check if the problem is related to inserting the data
and/or retrieving it. Also try not using placeholders.

Thanks.

Tim.

Brigitte Jellinek

unread,
Jun 17, 2003, 11:12:06 AM6/17/03
to dbi-...@perl.org
> On Tue, Jun 17, 2003 at 11:57:25AM +0200, Brigitte Jellinek wrote:
> > [...] I write utf-8 into mysql -

> > that seems to work. but when I read it back, not only do I get
> > a "classical perl string" instead of an utf8 string, but also some
> > of the bits are broken:

On Tue, Jun 17, 2003 at 11:36:13AM +0100, Tim Bunce answered:


> Please try using another tool (eg mysqldump) to read the data from
> the db to check if the problem is related to inserting the data
> and/or retrieving it. Also try not using placeholders.

ok, i tried both: mysqldump confirms: the problem is not
in inserting, but in reading from the db.

To investigate this further, I wrote one-character strings into the db:

foreach (0x21 ... 0x7e, 0xa1 ... 0xff )
{
$char = pack "U", $_;
$dbh->do("INSERT INTO unitest VALUES (NULL,'$char','$char')")

or print $dbh->errstr;
$id = $dbh->{insertid};
$sth = $dbh->prepare("SELECT * FROM unitest WHERE ID = ?");
$sth->execute( $id );
@row = $sth->fetchrow_array();

# ....
}

I find that in the range 0xc0 to 0xdf the last three bits get set to
001:

write 11000011100 = \x{00C0}
read 11000011001
....
write 11000011100 = \x{00DF}
read 11000011001

The same holds for range 0x010d to 0x011f, range 0x0140 to 0x015f, range
0x1213 to 0x181f, range 0x1840 to 0x185f, and so on.

ps.
i use perl(5.8) + dbi(1.37) + dbd::mysql(2.1026) + mysql(4.1.0-alpha)

pps.
my table is


CREATE TABLE unitest (
ID mediumint(9) NOT NULL auto_increment,
vc varchar(80) default NULL,
t text,
PRIMARY KEY (ID)
) TYPE=MyISAM CHARSET=utf8;

Brigitte Jellinek

unread,
Jun 17, 2003, 11:54:55 AM6/17/03
to dbi-...@perl.org
On Tue, Jun 17, 2003 at 05:12:06PM +0200, Brigitte Jellinek wrote:
> > On Tue, Jun 17, 2003 at 11:57:25AM +0200, Brigitte Jellinek wrote:
> > > [...] I write utf-8 into mysql -
> > > that seems to work. but when I read it back, not only do I get
> > > a "classical perl string" instead of an utf8 string, but also some
> > > of the bits are broken:
>
> On Tue, Jun 17, 2003 at 11:36:13AM +0100, Tim Bunce answered:
> > Please try using another tool (eg mysqldump) to read the data from
> > the db to check if the problem is related to inserting the data
> > and/or retrieving it. Also try not using placeholders.
>
> ok, i tried both: mysqldump confirms: the problem is not
> in inserting, but in reading from the db.

wrong! actually, the problem is in the inserting.

I'm now inserting it two ways:

1) from perl directly via DBI
2) perl -> file with sql -> mysql client

the file with the sql statements is still ok. if I look at the
data with the mysql-commandline client, version 1 and version 2 look the
same:

+------+------+------+-------+
| ID | code | vc | count |
+------+------+------+-------+
| 3058 | 010D | Ä? | 3 |
| 3167 | 010D | Ä? | 3 |
+------+------+------+-------+

when I read it out using perl+dbi, i get for example
for the unicode-character \x{010D}, which should be c with a reverse ^ on top,
the bits 1100010000111111
instead of 1100010010001101

ps.
i still use perl(5.8) + dbi(1.37) + dbd::mysql(2.1026) + mysql(4.1.0-alpha)

pps.
my table for this example is
CREATE TABLE chartab (


ID mediumint(9) NOT NULL auto_increment,

code varchar(10) default NULL,
vc varchar(10) default NULL,
count int(8),


PRIMARY KEY (ID)
) TYPE=MyISAM CHARSET=utf8;

--
Dipl.Ing. Brigitte Jellinek Horus GmbH
http://www.horus.at/team/bjelli.html Internet Beratung + Schulung
bje...@horus.at Web Design + Software Engineering

Brigitte Jellinek

unread,
Jun 18, 2003, 8:31:37 AM6/18/03
to dbi-...@perl.org

Yesterday I wrote about problems with reading utf8,
later I realized that the problems were actually in
the inserting, not the reading. Today I found a workaround.

The problem is caused by mysql, not by perl.
Maybe I'm doing something really stupit, but it seems
that my mysql-server doesn't accept utf8 as utf8. In the
most basic version, using a utf-8 file with sql commands:

mysql --default-character-set=utf8 test < test-insert.sql

my mysql server still interprets the utf8 strings in the file
as latin1 (and mangles all the non-latin1-charachters).

I tried inserting

SET CHARACTER SET utf8;

into the file, but that didn't help.

I had to change all occurences of literal utf8 strings in my SQL
statements, like so:

INSERT INTO sometable VALUES (....,CONVERT(_utf8'blabla' USING utf8), ...);

where 'blabla' is a string that might contain utf8.
the same for SELECT:

SELECT * FROM sometabel
WHERE somecolumn LIKE CONVERT(_utf8'%blabla%' USING utf8);

I now use this workaround through DBI. the data I get back is
not flagged as utf8, but it is correct utf8, so I can decode it
without a problem:

# to search for books with a title containging $char:
my $sql = qq{
SELECT * FROM buch WHERE Originaltitel LIKE
CONVERT(_utf8 '%$char%' USING utf8)
};
my $sth = $dbh->prepare($sql);
$sth->execute();

my (@r, $i);
while( @r = $sth->fetchrow_array() ) {
$_ = decode("utf8", $_) foreach @r;
print "<p>@r</p>";
}

It works. I'm glad. Thanks for the help.

Brigitte


p.s.
you can find the details at
http://perlwelt.horus.at/Beispiele/Magic/PerlUnicodeMysql/

--
Brigitte 'I never met a chocolate I didnt like' Jellinek

bje...@horus.com http://www.horus.com/~bjelli/
http://perlwelt.horus.at http://www.perlmonks.org/index.pl?node=bjelli

Hardy Merrill

unread,
Jun 18, 2003, 8:42:52 AM6/18/03
to Brigitte Jellinek, dbi-...@perl.org
I've been following this thread with interest - really
curiosity. Your workaround seems like a lot of work -
have you tried posting your question on a mysql mailing
list?

Paul Dubois has written some very good books on MySQL
with sections on DBI use - I think Paul is on this list -
I'm hoping Paul will respond with what he knows about
utf8 use in MySQL.

--
Hardy Merrill
Red Hat, Inc.

Paul Dubois

unread,
Jun 18, 2003, 8:56:42 AM6/18/03
to Hardy Merrill, Brigitte Jellinek, dbi-...@perl.org
At 8:42 -0400 6/18/03, Hardy Merrill wrote:
>I've been following this thread with interest - really
>curiosity. Your workaround seems like a lot of work -
>have you tried posting your question on a mysql mailing
>list?
>
>Paul Dubois has written some very good books on MySQL
>with sections on DBI use - I think Paul is on this list -
>I'm hoping Paul will respond with what he knows about
>utf8 use in MySQL.

My suggestion is that if you're using 4.1.0, wait for 4.1.1 and
try it again. There are lots of character-set-handling related
changes, and the problem you're seeing may well be fixed. I know
that some of the new work involves transmission of information
between client and server.

If you *are* using 4.1.1 from the development tree (and thus the
problem remains unsolved), then please report your findings to
bugs.mysql.com so that the problem can be addressed. Thanks!

0 new messages