Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

What to do with UTF-8 data?

8 views
Skip to first unread message

Steve Hay

unread,
Sep 9, 2003, 12:43:21 PM9/9/03
to dbi-...@perl.org
Hi,

Having looked through the list archives a bit, I see there has been lots
of discussion over what to do with UTF-8 data, but I can't find anything
that really helps me out now.

I'm running Perl 5.8.0 with MySQL 3.23.56 (via DBD-mysql). Since (I
think) there is no native UTF-8 support in MySQL below 4.1.x, my plan
was to simply store the bytes of each UTF-8 character in the database
(either by explicitly converting Perl's UTF-8 strings to sequences of
octets using Encode::encode_utf8(), or else just letting that happen of
its own accord as it seems to do :-s), and then convert such octet
sequences back to UTF-8 strings when retrieving data from the database
using Encode::decode_utf8().

As long as I store *all* my data in this way, those conversions should
always succeed. (encode_utf8() never fails anyway, and decode_utf8()
will always work here because I'm always feeding it "valid" data.)

The problem is: How do I trap all input/output to/from DBI to do these
conversions?

I can easily do it "manually":

$dbh->do('INSERT INTO foo (bar) VALUES (?)', undef,
Encode::encode_utf8($input_utf8str));
...
my @octets_row = $dbh->selectrow_array('SELECT bar FROM foo');
my $output_utf8str2 = Encode::decode_utf8($octets_row[0]);

but that's way too tedious in practice. I want to have those
conversions done for me automatically, perhaps by having specified some
appropriate encode/decode callbacks for certain hooks. Do any such
hooks exist?

If, as I suspect, there are not, then what is the currently recommended
solution here?

Would it be at all easy to add such hooks, or some other means of
specifying that I want these conversions done, to DBI as a stop-gap
solution?

Thanks,
- Steve

Chuck Fox

unread,
Sep 9, 2003, 1:24:36 PM9/9/03
to stev...@uk.radan.com, dbi-...@perl.org
Steve,

I am a Sybase DBA, but in situations like this, I have declared the
column on the table to be varbinary or binary and stored the data
directly without conversion. Don't know if MySql supports this datatype.

HTH,

Chuck Fox
Principal Database Adminstrator
America Online, INC.

Steve Hay

unread,
Sep 10, 2003, 3:33:03 AM9/10/03
to Chuck Fox, dbi-...@perl.org
Chuck Fox wrote:

> Steve,
>
> I am a Sybase DBA, but in situations like this, I have declared the
> column on the table to be varbinary or binary and stored the data
> directly without conversion. Don't know if MySql supports this datatype.

One can declare a column "VARCHAR(n) BINARY" for a similar thing to
Sybase's "VARBINARY(n)", but unfortunately it makes no difference. If I
store the data directly without conversion into such a column then, as
far as I can make out, each character's bytes get stored exactly as if I
had converted to bytes beforehand. And when I retrieve the data (again
without conversion) I just get octet sequences into my Perl scalars -
not flagged, UTF-8 character strings as I would like.

This is all exactly the same as what I get with a "VARCHAR(n)" column.

- Steve

Peter J. Holzer

unread,
Sep 10, 2003, 4:14:19 AM9/10/03
to dbi-...@perl.org
On 2003-09-10 08:33:03 +0100, Steve Hay wrote:

Yes, but the difference is that with a binary column the database knows
that the data is not character data in a charset it knows. For example
the MySQL manual states:

Values in CHAR and VARCHAR columns are sorted and compared in
case-insensitive fashion, unless the BINARY attribute was specified
when the table was created.

This doesn't work with UTF-8 data of course, because the individual
bytes of a multi-byte-character are not whole characters and hence don't
have a uppercase or lowercase equivalent. So two strings which should
compare equal generally won't, and sometimes strings which should not
compare equal will. If you just tell the database "this is binary data,
not character data" it won't try to do case conversions on it and it
(hopefully) will stop you doing them in SQL code (You can and must do
them in perl).

hp

Disclaimer: I haven't actually used MySQL for some time, this is general
advice, not targeted at the specific way MySQL compares strings.

--
_ | Peter J. Holzer | Unser Universum wäre betrüblich
|_|_) | Sysadmin WSR / LUGA | unbedeutend, hätte es nicht jeder
| | | h...@wsr.ac.at | Generation neue Probleme bereit.
__/ | http://www.hjp.at/ | -- Seneca, naturales quaestiones

Bart Lateur

unread,
Sep 10, 2003, 5:31:40 AM9/10/03
to dbi-...@perl.org
On Wed, 10 Sep 2003 08:33:03 +0100, Steve Hay wrote:

>And when I retrieve the data (again
>without conversion) I just get octet sequences into my Perl scalars -
>not flagged, UTF-8 character strings as I would like.

If you're *sure* that this is UTF-8, only perl doesn't flag it as such,
you can set the flag yourself. In perl 5.8.x, you can use the Encode
module, one of the functions documented near the bottom, here the
_utf8_on(STRING). That's an inplace modifying function, so you use it
like

_utf8_on($string_that_should_be_utf8);

For perl 5.6.x (and likely for 5.8 too), you can achieve the same effect
by using pack() this way:

$flagged_as_utf8 = pack "U0a*", $string_that_should_be_utf8;


Earlier perls than 5.6 don't have this UTF8 flag, nor do they accept the
"U" template.

--
Bart.

Steve Hay

unread,
Sep 10, 2003, 5:40:29 AM9/10/03
to Bart Lateur, dbi-...@perl.org
Bart Lateur wrote:

>On Wed, 10 Sep 2003 08:33:03 +0100, Steve Hay wrote:
>
>
>
>>And when I retrieve the data (again
>>without conversion) I just get octet sequences into my Perl scalars -
>>not flagged, UTF-8 character strings as I would like.
>>
>>
>
>If you're *sure* that this is UTF-8, only perl doesn't flag it as such,
>you can set the flag yourself. In perl 5.8.x, you can use the Encode
>module, one of the functions documented near the bottom, here the
>_utf8_on(STRING). That's an inplace modifying function, so you use it
>like
>
> _utf8_on($string_that_should_be_utf8);
>

Yep, that's an alternative to the "$now_is_utf8 =
Encode::decode_utf8($should_be_utf8)" call that I described in my
original posting.

But the question was: How can I arrange for such conversions to be
performed automatically by DBI whenever it receives or returns data?

- Steve

Bart Lateur

unread,
Sep 10, 2003, 6:14:25 AM9/10/03
to dbi-...@perl.org
On Wed, 10 Sep 2003 10:40:29 +0100, Steve Hay wrote:

>But the question was: How can I arrange for such conversions to be
>performed automatically by DBI whenever it receives or returns data?

Well, there are two options... either does the dtabase somewhere stores
the flag indicating that some string is in UTF8, or you have to add that
information yourself. For the latter, I don't know if it'll actually
work, but it seems like an appropriate way to do it: add a "BOM" marker
at the start of the string.

<http://www.unicode.org/unicode/faq/utf_bom.html#22> (and below)

--
Bart.

Steve Hay

unread,
Sep 10, 2003, 6:42:23 AM9/10/03
to Bart Lateur, dbi-...@perl.org
Bart Lateur wrote:

>On Wed, 10 Sep 2003 10:40:29 +0100, Steve Hay wrote:
>
>
>
>>But the question was: How can I arrange for such conversions to be
>>performed automatically by DBI whenever it receives or returns data?
>>
>>
>
>Well, there are two options... either does the dtabase somewhere stores
>the flag indicating that some string is in UTF8, or you have to add that
>information yourself. For the latter, I don't know if it'll actually
>work, but it seems like an appropriate way to do it: add a "BOM" marker
>at the start of the string.
>

I don't think the MySQL 3.x stores any flag to indicate that a string is
UTF8, and even if it did I'm not aware of anything in DBI or DBD-mysql
that would make use of it, e.g. to decode data flagged in such a way
into Perl's internal format.

Adding a BOM myself to the string seems to have problems of its own (see
http://www.unicode.org/unicode/faq/utf_bom.html#27), and again I'm not
aware of DBI / DBD-mysql having anything in them that would make use of
such a BOM. Please correct me if I'm wrong - that could be just the
sort of thing that I'm looking for here.

- Steve


Peter J. Holzer

unread,
Sep 10, 2003, 6:34:12 AM9/10/03
to dbi-...@perl.org
On 2003-09-10 10:40:29 +0100, Steve Hay wrote:
> But the question was: How can I arrange for such conversions to be
> performed automatically by DBI whenever it receives or returns data?

You could subclass DBI or DBD::MySQL and replace all methods with
wrappers which perform the conversion. I'm not convinced that this is a
good idea, though. I'd rather try to the the conversion in application
specific layer above DBI.

hp

Peter J. Holzer

unread,
Sep 10, 2003, 6:46:36 AM9/10/03
to dbi-...@perl.org
On 2003-09-10 12:14:25 +0200, Bart Lateur wrote:
> On Wed, 10 Sep 2003 10:40:29 +0100, Steve Hay wrote:
>
> >But the question was: How can I arrange for such conversions to be
> >performed automatically by DBI whenever it receives or returns data?
>
> Well, there are two options... either does the dtabase somewhere stores
> the flag indicating that some string is in UTF8, or you have to add that
> information yourself. For the latter, I don't know if it'll actually
> work, but it seems like an appropriate way to do it: add a "BOM" marker
> at the start of the string.

That doesn't help Steve. He already knows that the data is UTF-8, he
doesn't need the marker to distinguish between UTF-8 and Latin-X.

His problem is that when he selects from the database, he has to
manually convert from utf-8 to "perl-internal":

while (my ($foo, $bar) = $sth->fetchrow_array()) {
$foo = decode_utf8($foo);
$bar = decode_utf8($bar);

# do something with foo and bar
}

and he wants to happen the decode step automatically.

Since MySQL 4.1 does support UTF-8: Is it possible to upgrade from MySQL
3.23 to 4.1?

Tim Bunce

unread,
Sep 10, 2003, 10:25:03 AM9/10/03
to Steve Hay, Bart Lateur, dbi-...@perl.org

Basically it should be the job of the drivers to set the uft8 flag on
data being retrieved if it is utf8. I believe that the new mysql v4.1
protocol does provide information about the characterset of each colum.
DBD::mysql can use that.

For people stuck with older versions of mysql, a driver private
option could be used to indicate that all char fields are utf8,
or have some way of indicating that per-column, such as

$sth->bind_col(1, undef, { mysql_charset => 'utf8' });

Tim.

Steve Hay

unread,
Sep 10, 2003, 10:37:30 AM9/10/03
to Peter J. Holzer, dbi-...@perl.org
Peter J. Holzer wrote:

>Since MySQL 4.1 does support UTF-8: Is it possible to upgrade from MySQL
>3.23 to 4.1?
>

I might be able to upgrade to 4.0 (I fact, I really ought to...), but I
don't fancy 4.1 just yet -- it's still an alpha release :-(

- Steve

Steve Hay

unread,
Sep 10, 2003, 10:41:42 AM9/10/03
to Peter J. Holzer, dbi-...@perl.org
Peter J. Holzer wrote:

>On 2003-09-10 10:40:29 +0100, Steve Hay wrote:
>
>
>>But the question was: How can I arrange for such conversions to be
>>performed automatically by DBI whenever it receives or returns data?
>>
>>
>
>You could subclass DBI or DBD::MySQL and replace all methods with
>wrappers which perform the conversion. I'm not convinced that this is a
>good idea, though. I'd rather try to the the conversion in application
>specific layer above DBI.
>

Sub-classing DBI wouldn't help me since I'm using Class::DBI which won't
know to use my DBI sub-class.

Sub-classing DBD-mysql might be more feasible, though, since I specify
what driver to use.

Actually, I had hoped to find some appropriate hooks in Class::DBI, but
I don't see any. There are "select" and "before_set" triggers, but the
latter are per-column, which would be a real pain to set up.

- Steve

Steve Hay

unread,
Sep 10, 2003, 10:46:20 AM9/10/03
to Tim Bunce, Bart Lateur, dbi-...@perl.org
Tim Bunce wrote:

>On Wed, Sep 10, 2003 at 11:42:23AM +0100, Steve Hay wrote:
>
>
>>Bart Lateur wrote:
>>
>>
>>
>>>On Wed, 10 Sep 2003 10:40:29 +0100, Steve Hay wrote:
>>>
>>>
>>>
>>>>But the question was: How can I arrange for such conversions to be
>>>>performed automatically by DBI whenever it receives or returns data?
>>>>
>>>>
>>>Well, there are two options... either does the dtabase somewhere stores
>>>the flag indicating that some string is in UTF8, or you have to add that
>>>information yourself. For the latter, I don't know if it'll actually
>>>work, but it seems like an appropriate way to do it: add a "BOM" marker
>>>at the start of the string.
>>>
>>>
>>I don't think the MySQL 3.x stores any flag to indicate that a string is
>>UTF8, and even if it did I'm not aware of anything in DBI or DBD-mysql
>>that would make use of it, e.g. to decode data flagged in such a way
>>into Perl's internal format.
>>
>>Adding a BOM myself to the string seems to have problems of its own (see
>>http://www.unicode.org/unicode/faq/utf_bom.html#27), and again I'm not
>>aware of DBI / DBD-mysql having anything in them that would make use of
>>such a BOM. Please correct me if I'm wrong - that could be just the
>>sort of thing that I'm looking for here.
>>
>>
>
>Basically it should be the job of the drivers to set the uft8 flag on
>data being retrieved if it is utf8. I believe that the new mysql v4.1
>protocol does provide information about the characterset of each colum.
>DBD::mysql can use that.
>

Ah. In that case, I should get onto the DBD-mysql people to look for
assistance. I was thinking that DBI itself would be adding some kind of
UTF-8 support.

>
>For people stuck with older versions of mysql, a driver private
>option could be used to indicate that all char fields are utf8,
>or have some way of indicating that per-column, such as
>
> $sth->bind_col(1, undef, { mysql_charset => 'utf8' });
>

OK, I'll pass this suggestion on to the DBD-mysql maintainer(s).

Thanks,
- Steve

Steve Hay

unread,
Sep 11, 2003, 7:31:52 AM9/11/03
to Tim Bunce, Jochen Wiedmann, pe...@lists.mysql.com, dbi-...@perl.org, dbi...@perl.org
Tim Bunce wrote:

>On Thu, Sep 11, 2003 at 08:29:50AM +0200, Jochen Wiedmann wrote:
>
>
>>Hi, Steve,


>>
>>
>>
>>>The problem is: How do I trap all input/output to/from DBI to do these
>>>conversions?
>>>
>>>

>>>I've asked about this on the dbi-users mailing list, and the answer
>>>(from Tim Bunce, no less) was that it is really the responsibility of
>>>the DBD driver to perform such conversions if the data in question is UTF-8.
>>>
>>>
>
>That's not quite right. I wasn't talking about any _conversions_ at all.
>
I'm sorry if I mis-quoted you. I meant setting the UTF-8 flag on an
octet sequence that can be interpreted as UTF-8, rather than leaving it
unflagged and treated as Latin-1. Thus, the data is in some sense
"converted" from Latin-1 to UTF-8.

>
>
>
>>after letting my thoughts settle I come to the conclusion that I do not
>>agree completely. I think that DBI should do 80% of the job and leave
>>about 20% to the driver authors.
>>
>>
>
>For a "full solution" yes, I agree - and I've written about this in the past.
>
>For now I'm just talking about the specific but fairly common
>situation of fetching data that is utf8 encoded but it doesn't
>get flagged as such by the driver.
>
>For that case the driver just needs to know when to do a SvUTF8_on(sv).
>
Exactly.

What about data going _into_ the database? In my examples of doing the
conversion manually with Encode::{en|de}code_utf8(), I was converting
the Perl strings to octet sequences that could later be interpreted as
UTF-8 before insertion into the database. That way I could guarantee
that all data retrieved from the database can be converted to UTF-8, in
fact (as you pointed out) by simply turning the UTF-8 flag on.

If all the data that I insert really is UTF-8 then I guess it will just
get "serialised" as a sequence of octets, and everything will be OK.

But what if the data I'm inserting isn't all UTF-8? The problem is:

1. Perl's internal format isn't just UTF-8 -- it defaults to Latin-1 (or
whatever) for strings in which every character can be represented in
Latin-1;
2. The "8-bit" characters of Latin-1 are represented as two-byte
characters in UTF-8.

So, if I have the string "Copyright © Fred Bloggs" in Perl then it will
not be UTF-8: the © is stored as one byte, not two not, and the UTF-8
flag is off. If I insert that straight into the database without
running it through Encode::encode_utf8() first, then © itself, rather
than its two-byte UTF-8 representation gets stored in the database, so
when it gets retrieved from the database later you can't just turn the
UTF-8 flag on -- you would need to run it through Encode::decode_utf8().

In other words, just having the driver switching the UTF-8 on and off
will only work if I guarantee that all the strings I feed it to start
with really are UTF-8, even when Perl would not normally have
represented them as such.

It would be cool if something akin to "binmode STDOUT, ':utf8';" could
be applied when sending data to the driver -- i.e. my data is in "Perl's
internal format", whether that be Latin-1 or UTF-8 in the case of the
string at hand, and it all gets automagically upgraded to UTF-8 if
necessary before insertion into the database. Then you only need to
turn the flag on when retrieving it again.

At least, I think that's what I want :-s

- Steve

Bart Lateur

unread,
Sep 11, 2003, 10:25:56 AM9/11/03
to pe...@lists.mysql.com, dbi-...@perl.org, dbi...@perl.org
On Thu, 11 Sep 2003 12:31:52 +0100, Steve Hay wrote:

>It would be cool if something akin to "binmode STDOUT, ':utf8';" could
>be applied when sending data to the driver -- i.e. my data is in "Perl's
>internal format", whether that be Latin-1 or UTF-8 in the case of the
>string at hand, and it all gets automagically upgraded to UTF-8 if
>necessary before insertion into the database.

Oh that's easy to achieve. Just concatenate the string with an UTF-8
string, and you'll get an UTF-8 string. Perl will do the upgrading for
you.

Just try it:

$zero_length_utf8 = pack "U0"; # UTF8, length ==0
$string = "élève"; # Latin-1
$string .= $zero_length_utf8; # upgrade to UTF8
print $string;

Now the reverse is much harder... :)

--
Bart.

0 new messages