DB_File has problems with UTF8?

Alberto Manuel Brandão simões

unread,

Dec 3, 2003, 3:51:24 PM12/3/03

to perl5-...@perl.org

Hi.
I was just playing with Gtk2 binding for perl with a direct tie to a DB
(well, was MLDBM) file and found some problems.

I am attaching a perl script in UTF8 which does not work, and I think it
should. Please, if it my fault explain how to solve it.

Best regards,
Alberto

teste.pl

Paul Marquess

unread,

Dec 4, 2003, 8:45:14 AM12/4/03

to Alberto Manuel Brandão Simões, perl5-...@perl.org

Alberto,

in what way does it not work? What did you expect it to do and what did it
actually do?

cheers
Paul

Alberto Manuel Brandão simões

unread,

Dec 4, 2003, 3:21:22 PM12/4/03

to Perl 5 porters

Basically, I've set a key and data in utf8 in the db file and, when I
retrieve them perl thinks they are normal strings (not in utf8).
So, if my key is 'cÃ£o' (dog in Portuguese) DB will return cÃƒÂ£o as 4
different characters...

I've tried to use Encode to encode the strings to iso-8859-1 before
printing them (though it could be a print/terminal problem) but it
prints right the same.

Cheers,
Alb
>
> cheers
> Paul
>

Paul Marquess

unread,

Dec 5, 2003, 4:54:58 AM12/5/03

to Alberto Manuel Brandão Simões, Perl 5 porters

From: Alberto Manuel Brandão Simões

>
>
> On Thu, 2003-12-04 at 13:45, Paul Marquess wrote:
> > >
> > > Hi.
> > > I was just playing with Gtk2 binding for perl with a direct
> tie to a DB
> > > (well, was MLDBM) file and found some problems.
> > >
> > > I am attaching a perl script in UTF8 which does not work, and
> I think it
> > > should. Please, if it my fault explain how to solve it.
> >
> > Alberto,
> >
> > in what way does it not work? What did you expect it to do and
> what did it
> > actually do?
>
> Basically, I've set a key and data in utf8 in the db file and, when I
> retrieve them perl thinks they are normal strings (not in utf8).

> So, if my key is 'cÃƒÂ£o' (dog in Portuguese) DB will return
> cÃƒÂƒÃ‚Â£o as 4

> different characters...
>
> I've tried to use Encode to encode the strings to iso-8859-1 before
> printing them (though it could be a print/terminal problem) but it
> prints right the same.

If I removed the "use utf8" from your script, it seems to work.

Here is a slight revamp of your script that I used.

Paul

#!/usr/bin/perl

#use utf8;
#use locale;
use DB_File;
use Fcntl;

use strict;
use warnings;

my %db;
tie %db, 'DB_File', 'foo.db', O_RDWR|O_CREAT, 0666, $DB_BTREE;

while (my ($k, $v) = each %db) {
print "$k -> $v\n";
print "key ok\n" if $k eq "cÃ£o";
print "val ok\n" if $v eq "Animal que nÃ£o Ã© o gato";
}

$db{"cÃ£o"} = "Animal que nÃ£o Ã© o gato";

print "ok\n" if $db{"cÃ£o"} eq "Animal que nÃ£o Ã© o gato";

cheers
Paul

Alberto Manuel Brandão simões

unread,

Dec 5, 2003, 9:55:01 AM12/5/03

to Paul.M...@btinternet.com, Perl 5 porters

On Fri, 2003-12-05 at 09:54, Paul Marquess wrote:
> From: Alberto Manuel Brandão Simões
>
> >
> >
> > On Thu, 2003-12-04 at 13:45, Paul Marquess wrote:
> > > >
> > > > Hi.
> > > > I was just playing with Gtk2 binding for perl with a direct
> > tie to a DB
> > > > (well, was MLDBM) file and found some problems.
> > > >
> > > > I am attaching a perl script in UTF8 which does not work, and
> > I think it
> > > > should. Please, if it my fault explain how to solve it.
> > >
> > > Alberto,
> > >
> > > in what way does it not work? What did you expect it to do and
> > what did it
> > > actually do?
> >
> > Basically, I've set a key and data in utf8 in the db file and, when I
> > retrieve them perl thinks they are normal strings (not in utf8).
> > So, if my key is 'cÃƒÂ£o' (dog in Portuguese) DB will return
> > cÃƒÂƒÃ‚Â£o as 4
> > different characters...
> >
> > I've tried to use Encode to encode the strings to iso-8859-1 before
> > printing them (though it could be a print/terminal problem) but it
> > prints right the same.
>
> If I removed the "use utf8" from your script, it seems to work.
>

But, removing the utf8 pragma, perl will interpret that strings as
sequences of characters, not unicode ones.

>From the utf8 manpage:
Until UTF-8 becomes the default format for source text,
either this pragma or the "encoding" pragma should be used
to recognize UTF-8 in the source. When UTF-8 becomes the
standard source format, this pragma will effectively
become a no-op.

So, I think there is still that problem. You didn't solve the problem...
you simply removed the use of real utf8 (well, at least is this how I
see the problem).

Paul Marquess

unread,

Dec 5, 2003, 11:36:47 AM12/5/03

to al...@alfarrabio.di.uminho.pt, Dan Kogai, Perl 5 porters

Time to call in the expert. Dan, can you comment please?

Last time I remember something like this come up, the line was that XS modules that read/wrote Perl string to file didn't have to do anything different. Is that true?

Paul

Roman Vasicek

unread,

Dec 5, 2003, 11:04:34 AM12/5/03

to perl5-...@perl.org

DBM filter does not solve your problem? Using

filter_fetch_key(sub {$_=decode_utf8($_)});
filter_store_key(sub {$_=encode_utf8($_)});

and also the same for value. Or only for fetching from database if
storing is ok.

Roman Vasicek

Alberto Manuel Brandão simões wrote:
> On Thu, 2003-12-04 at 13:45, Paul Marquess wrote:
>
>>>Hi.
>>>I was just playing with Gtk2 binding for perl with a direct tie to a DB
>>>(well, was MLDBM) file and found some problems.
>>>
>>>I am attaching a perl script in UTF8 which does not work, and I think it
>>>should. Please, if it my fault explain how to solve it.
>>
>>Alberto,
>>
>>in what way does it not work? What did you expect it to do and what did it
>>actually do?
>
>
> Basically, I've set a key and data in utf8 in the db file and, when I
> retrieve them perl thinks they are normal strings (not in utf8).

> So, if my key is 'cÃƒÂ£o' (dog in Portuguese) DB will return cÃƒÂƒÃ‚Â£o as 4

Nick Ing-Simmons

unread,

Dec 7, 2003, 5:56:23 PM12/7/03

to Paul.M...@btinternet.com, Dan Kogai, al...@alfarrabio.di.uminho.pt, Perl 5 porters

Paul Marquess <Paul.M...@btinternet.com> writes:
>> >From the utf8 manpage:
>> Until UTF-8 becomes the default format for source text, either this
>> pragma or the "encoding" pragma should be used to recognize UTF-8 in
>> the source. When UTF-8 becomes the standard source format, this
>> pragma will effectively become a no-op.
>>
>> So, I think there is still that problem. You didn't solve the problem... you
>> simply removed the use of real utf8 (well, at least is this how I see the
>> problem).
>
>Time to call in the expert. Dan, can you comment please?
>
>Last time I remember something like this come up, the line was that XS modules
>that read/wrote Perl string to file didn't have to do anything different. Is
>that true?

No.

An SV can be in one of two states as shown by SvUTF8(sv)
If that is true then SvPV is in UTF-8, it it is false then SvPV is
octets, assumed to be in iso-8859-1 in some legacy sense.

So if XS code just reads/writes SvPV its should do one of:
A. if (SvUTF8(sv)) downgrade to bytes and croak if it can't
i.e. external file is octets, user should "encode" data
before write and "decode" after read.
e.g. call SvPVbyte()

B. if (!SvUTF8(sv)) upgrade to UTF-8 (this should not fail).
i.e. file is in "characters" held as UTF-8.
e.g. call SvPVutf8()

C. Store the state of flag with data and re-set on read.

My guess is DB_File should do (A) for data part, less clear what
to do about the key part.

Paul Marquess

unread,

Dec 8, 2003, 10:49:35 AM12/8/03

to Nick Ing-Simmons, Dan Kogai, al...@alfarrabio.di.uminho.pt, Perl 5 porters

From: Nick Ing-Simmons [mailto:ni...@ing-simmons.net]

> Paul Marquess <Paul.M...@btinternet.com> writes:
> >> >From the utf8 manpage:
> >> Until UTF-8 becomes the default format for source text,
> either this
> >> pragma or the "encoding" pragma should be used to
> recognize UTF-8 in
> >> the source. When UTF-8 becomes the standard source format, this
> >> pragma will effectively become a no-op.
> >>
> >> So, I think there is still that problem. You didn't solve the
> problem... you
> >> simply removed the use of real utf8 (well, at least is this
> how I see the
> >> problem).
> >
> >Time to call in the expert. Dan, can you comment please?
> >
> >Last time I remember something like this come up, the line was
> that XS modules
> >that read/wrote Perl string to file didn't have to do anything
> different. Is
> >that true?
>
> No.

:-)

> An SV can be in one of two states as shown by SvUTF8(sv)
> If that is true then SvPV is in UTF-8, it it is false then SvPV is
> octets, assumed to be in iso-8859-1 in some legacy sense.
>
> So if XS code just reads/writes SvPV its should do one of:
> A. if (SvUTF8(sv)) downgrade to bytes and croak if it can't
> i.e. external file is octets, user should "encode" data
> before write and "decode" after read.
> e.g. call SvPVbyte()
>
> B. if (!SvUTF8(sv)) upgrade to UTF-8 (this should not fail).
> i.e. file is in "characters" held as UTF-8.
> e.g. call SvPVutf8()
>
> C. Store the state of flag with data and re-set on read.
>
> My guess is DB_File should do (A) for data part, less clear what
> to do about the key part.

This is coming back to me now. See http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2002-04/msg00787.html for a discussion on this very topic. I happened just prior to 5.8.0 being released.

The jist of that thread is that encoding/decoding to/from UTF-8 is an application issue and that DBM Filters was the way to do it.

I assume that the goalposts haven't moved, and is still true to say that this is an application issue.

This reminds me - is an enhanced DBM filters module that can handle utf-8 encoding/decoding still wanted in the core? The thread reference above seemed to imply that it was. I've had a prototype up and running for a while. I just need a push to tidy it up.

Paul

Alberto Manuel Brandão simões

unread,

Dec 8, 2003, 11:03:53 AM12/8/03

to Paul.M...@btinternet.com, Nick Ing-Simmons, Dan Kogai, Perl 5 porters

OK, that can be a solution, if DB really do not work with native utf8.
Now, I think the DB_File module can check if the string is utf_8 and try
to latin'ize it... if it fails, croak.

Or you can add Documentation in the POD to say that this problem exists
and how to solve it (Yes, it was that way I solved my problem, so far.)

Best regards,
Alberto

Paul Marquess

unread,

Dec 8, 2003, 10:50:29 AM12/8/03

to al...@alfarrabio.di.uminho.pt, Roman Vasicek, perl5-...@perl.org

Alberto,

You have utf-8 in both the key and the value, so you need to do this

use DB_File;
use Encode;

my $h = tie %hash, ...

$h->filter_fetch_key ( sub { $_ = Encode::decode_utf8($_) } );
$h->filter_fetch_value( sub { $_ = Encode::decode_utf8($_) } );
$h->filter_store_key ( sub { $_ = Encode::encode_utf8($_) } );
$h->filter_store_value( sub { $_ = Encode::encode_utf8($_) } );

Paul

sla...@rezic.de

unread,

Dec 8, 2003, 11:32:24 AM12/8/03

to Alberto Manuel Brandão Simões, dank...@dan.co.jp, Paul.M...@btinternet.com, ni...@ing-simmons.net, perl5-...@perl.org

You could check for valid utf8 but then there's the possiblity to get
false positives. I don't see a way to do the conversion automatically
without breaking oder db files.

> Or you can add Documentation in the POD to say that this problem exists
> and how to solve it (Yes, it was that way I solved my problem, so far.)

That's the way, maybe along with some predefined filters (either in
Pod or in .pm) to make the conversion easy.

Regards,
Slaven

--
__o Slaven Rezic
_`\<,_ slaven <at> rezic <dot> de
__(_)/ (_)____
______________________________________________________________________________

Gisle Aas

unread,

Dec 8, 2003, 11:45:38 AM12/8/03

to Alberto Manuel Brandão Simões, Paul.M...@btinternet.com, Nick Ing-Simmons, Dan Kogai, Perl 5 porters

Alberto Manuel Brandão Simões <al...@alfarrabio.di.uminho.pt> writes:

> OK, that can be a solution, if DB really do not work with native utf8.
> Now, I think the DB_File module can check if the string is utf_8 and try
> to latin'ize it... if it fails, croak.

Which is exactly what SvPVbyte() does for you. DB_File.xs should
basically replace its calls to SvPV() with SvPVbyte(), as Nick
suggested a few messages ago.

--Gisle

Paul Marquess

unread,

Dec 8, 2003, 12:11:20 PM12/8/03

to Gisle Aas, Alberto Manuel Brandão Simões, Paul.M...@btinternet.com, Nick Ing-Simmons, Dan Kogai, Perl 5 porters

From: Gisle Aas [mailto:gi...@ActiveState.com]

If that is the preferred method, then why do you need the ":utf8" layer when
writing utf8 to a standard files? [see perluniintro from 5.8.2]

This strikes me as being inconsistent.

Paul

Nick Ing-Simmons

unread,

Dec 8, 2003, 12:23:52 PM12/8/03

to Paul.M...@btinternet.com, Dan Kogai, al...@alfarrabio.di.uminho.pt, Nick Ing-Simmons, Perl 5 porters

Paul Marquess <Paul.M...@btinternet.com> writes:
>> An SV can be in one of two states as shown by SvUTF8(sv) If that is true
>> then SvPV is in UTF-8, it it is false then SvPV is octets, assumed to be in
>> iso-8859-1 in some legacy sense.
>>
>> So if XS code just reads/writes SvPV its should do one of:
>> A. if (SvUTF8(sv)) downgrade to bytes and croak if it can't
>> i.e. external file is octets, user should "encode" data before write
>> and "decode" after read.
>> e.g. call SvPVbyte()
>>

>> My guess is DB_File should do (A) for data part, less clear what to do about
>> the key part.
>
>This is coming back to me now. See http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2002-
>04/msg00787.html for a discussion on this very topic. I happened just prior to
>5.8.0 being released.
>
>The jist of that thread is that encoding/decoding to/from UTF-8 is an
>application issue and that DBM Filters was the way to do it.

Which is my (A).

>
>I assume that the goalposts haven't moved, and is still true to say that this
>is an application issue.

It is, but to be "bomb proof" you should still check that they
have done so - i.e. use SvPVbyte

Gisle Aas

unread,

Dec 8, 2003, 12:41:08 PM12/8/03

to Paul.M...@btinternet.com, Alberto Manuel Brandão Simões, Nick Ing-Simmons, Dan Kogai, Perl 5 porters

"Paul Marquess" <Paul.M...@btinternet.com> writes:

You don't need :utf8 layer if you are willing to encode() yourself.

You can put a layer like mechanism on DB_File access as well if you
wish to save the user from explicit encode() or decode() calls. Using
SvPVbyte() just ensure that you don't drop information silently when
marshalling strings to storage that only support bytes. It does not
prevent adding layers later.

Regards,
Gisle

Nick Ing-Simmons

unread,

Dec 8, 2003, 1:01:01 PM12/8/03

to Paul.M...@btinternet.com, Gisle Aas, Dan Kogai, Nick Ing-Simmons, Alberto Manuel Brandão Simões, Perl 5 porters

If it has :utf8 they they are not 'standard' files ;-)

However, :utf8 and :bytes are in line with all this:

:bytes (the default) causes IO to do equivalent of SvPVbyte - downgrade
or winge. (For historical reasons it is only a warn not a die.)

:utf8 tells IO to do equivalent of SvPVutf8

Doing this as a "mode" of the file handle makes sense for normal IO.
Note that non-tied hashes handle this by themselves (eventually with
some pain and false starts).

Feel free to make it a "mode" of DB_File too.

>
>This strikes me as being inconsistent.

What, perl inconsistent, - how unusual ;-)

Gisle Aas

unread,

Dec 8, 2003, 1:11:08 PM12/8/03

to Nick Ing-Simmons, Paul.M...@btinternet.com, Dan Kogai, Nick Ing-Simmons, Alberto Manuel Brandão Simões, Perl 5 porters

Nick Ing-Simmons <nick.ing...@elixent.com> writes:

> :bytes (the default) causes IO to do equivalent of SvPVbyte - downgrade
> or winge. (For historical reasons it is only a warn not a die.)

The historical reasons only applies to print, not to syswrite. So we
have print warn, but syswrite die.

> What, perl inconsistent, - how unusual ;-)

Regards,
Gisle

Paul Marquess

unread,

Dec 10, 2003, 12:25:51 PM12/10/03

to Nick Ing-Simmons, Gisle Aas, Dan Kogai, Nick Ing-Simmons, Alberto Manuel Brando Sim es, Perl 5 porters

From: Nick Ing-Simmons [mailto:nick.ing...@elixent.com]

Well I do have both a "utf8" and a general "encoding" filter written for my new DBM Filters module.

> >
> >This strikes me as being inconsistent.
>
> What, perl inconsistent, - how unusual ;-)

As long as it isn't inconsistency for its own sake :-)

I'm keen to get this sorted out once and for all, both for DB_File and for all the other *DB*_File modules. At the moment I'm leaning towards providing canned DBM filters to handle the utf8 problem, but I need to find some time to see what interaction SvPVbyte() has, if any, with the existing DBM filters hooks.

I guess the things I need to convince myself of before I decide one way or the other are:

1. Adding SvPVbyte() won't break existing code.
Is anyone relying on the existing behaviour?

2. Does adding SvPVbyte() (without the aid of a DBM filter or someone explicitly calling encode on the returned data) mean that round-trip integrity of the data is maintained, i.e. if I write some utf-8 data to a DBM file, then read it back, will it be identical to the original data?

3. Can it co-exist with the existing DBM Filters hooks?
I think the trick here would be to make sure that the SvPVbyte() call is done after any filters have been invoked. I'll need to dig into the code to see if this is an issue.

4. DB_File has a parallel existence on CPAN.
Whatever is done, DB_File still needs to be able to be built with legacy versions of Perl.

I'll have a play with SvPVbyte() and report back.

Paul

Nicholas Clark

unread,

Dec 10, 2003, 3:12:41 PM12/10/03

to Paul Marquess, Nick Ing-Simmons, Dan Kogai, al...@alfarrabio.di.uminho.pt, Perl 5 porters

On Mon, Dec 08, 2003 at 03:49:35PM -0000, Paul Marquess wrote:

> This reminds me - is an enhanced DBM filters module that can handle utf-8 encoding/decoding still wanted in the core? The thread reference above seemed to imply that it was. I've had a prototype up and running for a while. I just need a push to tidy it up.

I'm only go to integrate what I find in blead :-)
I can this specific filter as being useful to applications.
Would it also serve as a useful example of how to write a filter?

How large (small?) is it?

Nicholas Clark

Nick Ing-Simmons

unread,

Dec 10, 2003, 5:28:06 PM12/10/03

to Paul.M...@btinternet.com, Gisle Aas, Dan Kogai, Nick Ing-Simmons, Alberto Manuel Brando Sim es, Nick Ing-Simmons, Perl 5 porters

Paul Marquess <Paul.M...@btinternet.com> writes:
>From: Nick Ing-Simmons [mailto:nick.ing...@elixent.com]

>
>I'm keen to get this sorted out once and for all, both for DB_File and for
>all the other *DB*_File modules.

Well we can try.

>At the moment I'm leaning towards providing
>canned DBM filters to handle the utf8 problem, but I need to find some time
>to see what interaction SvPVbyte() has, if any, with the existing DBM
>filters hooks.
>
>I guess the things I need to convince myself of before I decide one way or the
>other are:
>
>1. Adding SvPVbyte() won't break existing code. Is anyone relying on the
> existing behaviour?

I don't think it can _break_ existing code. They can't really be relying
on the "use whatever SvPV happens to be" behaviour because it isn't reliable.
Their code may be working by a fluke, then one day someone puts a £ or ¢
some where new and phut.

>
>2. Does adding SvPVbyte() (without the aid of a DBM filter or someone
> explicitly calling encode on the returned data) mean that round-trip
> integrity of the data is maintained, i.e. if I write some utf-8 data to a
> DBM file, then read it back, will it be identical to the original data?

Yes a round trip is maintained. If they try and write chars > U+255 to
the thing without encoding them the _write_ will croak with SvPVbyte -
so the trip gets canceled on the outbound leg.

>
>3. Can it co-exist with the existing DBM Filters hooks? I think the trick here
> would be to make sure that the SvPVbyte() call is done after any filters
> have been invoked. I'll need to dig into the code to see if this is an
> issue.

I think the SvPVbyte check should be as close to the "file" and as far from
user perl code as possible (in the "pipe"). That way a hook of some kind
can be used to do the encode.

Note that on the read side the SvPVbyte-oid is to make sure that the
SV you return to perl isn't accidentally marked as UTF8.
This is just SvUTF8_off(sv) unless you have to handle appending you
data to existing UTF8 SV.

>
>4. DB_File has a parallel existence on CPAN. Whatever is done, DB_File still
> needs to be able to be built with legacy versions of Perl.

SvPVbyte dates back (with slightly evolving semantics) to 5.6 which
introduced UTF-8-ness

Paul Marquess

unread,

Dec 10, 2003, 6:42:09 PM12/10/03

to Nicholas Clark, Nick Ing-Simmons, Dan Kogai, al...@alfarrabio.di.uminho.pt, Perl 5 porters

From: Nicholas Clark [mailto:ni...@flirble.org]On Behalf Of Nicholas Clark

What I've done is to write a pure-perl module, called DBM_Filter, that makes
use of the existing low-level DBM filter hooks present in each of the
*DB*_File modules. It is about 200 lines worth of code.

It started off life prior to 5.8.0 when there was a discussion about how
best to provide an easy way for an Encoding filter to be applied to a DBM
file. The existing DBM filters were proposed, and although they can do the
job, I thought they are a bit too low level.

That's when I thought of the DBM_Filter module. The main thing it does is
provide a framework to make it easier to write DBM filters and to put common
filters, like encoding and null termination, into modules of their own.
Taking null termination as an example (this was the original reason for
writing the DBM filters in the first place), here is what a user has to do
at present

use DB_File;
my $db = tie %hash, ...

$db->filter_fetch_key ( sub { $_ .= "\x00" } );
$db->filter_fetch_value( sub { $_ .= "\x00" } );
$db->filter_store_key ( sub { s/\x00$// } );
$db->filter_store_value( sub { s/\x00$// } );

With the DBM_Filter module, they do this

use DBM_Filter;
use DB_File ;

my $db = tie %hash, ...

$db->Filter_Push('null');

and here is the complete null filter module:

package DBM_Filter::null ;

sub Store { $_ .= "\x00" ; }

sub Fetch { s/\x00$// ; }

1;

All the hard work is done in the DBM_Filter module. It allow the filters to
be applied to the keys only, the values only, or to both keys and values. It
also allows stacking of filters.

I also have an general purpose encoding filer, which isn't much bigger than
the null filter above. Its usage is

$db->Filter_Push('encoding' => 'iso-8859-16');

A good analogy is the difference between the original source filters module
and Damian's Filter::Simple.

Paul