I am attaching a perl script in UTF8 which does not work, and I think it
should. Please, if it my fault explain how to solve it.
Best regards,
Alberto
Alberto,
in what way does it not work? What did you expect it to do and what did it
actually do?
cheers
Paul
Basically, I've set a key and data in utf8 in the db file and, when I
retrieve them perl thinks they are normal strings (not in utf8).
So, if my key is 'cão' (dog in Portuguese) DB will return cão as 4
different characters...
I've tried to use Encode to encode the strings to iso-8859-1 before
printing them (though it could be a print/terminal problem) but it
prints right the same.
Cheers,
Alb
>
> cheers
> Paul
>
>
>
> On Thu, 2003-12-04 at 13:45, Paul Marquess wrote:
> > >
> > > Hi.
> > > I was just playing with Gtk2 binding for perl with a direct
> tie to a DB
> > > (well, was MLDBM) file and found some problems.
> > >
> > > I am attaching a perl script in UTF8 which does not work, and
> I think it
> > > should. Please, if it my fault explain how to solve it.
> >
> > Alberto,
> >
> > in what way does it not work? What did you expect it to do and
> what did it
> > actually do?
>
> Basically, I've set a key and data in utf8 in the db file and, when I
> retrieve them perl thinks they are normal strings (not in utf8).
> So, if my key is 'cão' (dog in Portuguese) DB will return
> cão as 4
> different characters...
>
> I've tried to use Encode to encode the strings to iso-8859-1 before
> printing them (though it could be a print/terminal problem) but it
> prints right the same.
If I removed the "use utf8" from your script, it seems to work.
Here is a slight revamp of your script that I used.
Paul
#!/usr/bin/perl
#use utf8;
#use locale;
use DB_File;
use Fcntl;
use strict;
use warnings;
my %db;
tie %db, 'DB_File', 'foo.db', O_RDWR|O_CREAT, 0666, $DB_BTREE;
while (my ($k, $v) = each %db) {
print "$k -> $v\n";
print "key ok\n" if $k eq "cão";
print "val ok\n" if $v eq "Animal que não é o gato";
}
$db{"cão"} = "Animal que não é o gato";
print "ok\n" if $db{"cão"} eq "Animal que não é o gato";
cheers
Paul
But, removing the utf8 pragma, perl will interpret that strings as
sequences of characters, not unicode ones.
>From the utf8 manpage:
Until UTF-8 becomes the default format for source text,
either this pragma or the "encoding" pragma should be used
to recognize UTF-8 in the source. When UTF-8 becomes the
standard source format, this pragma will effectively
become a no-op.
So, I think there is still that problem. You didn't solve the problem...
you simply removed the use of real utf8 (well, at least is this how I
see the problem).
Time to call in the expert. Dan, can you comment please?
Last time I remember something like this come up, the line was that XS modules that read/wrote Perl string to file didn't have to do anything different. Is that true?
Paul
filter_fetch_key(sub {$_=decode_utf8($_)});
filter_store_key(sub {$_=encode_utf8($_)});
and also the same for value. Or only for fetching from database if
storing is ok.
Roman Vasicek
Alberto Manuel Brandão simões wrote:
> On Thu, 2003-12-04 at 13:45, Paul Marquess wrote:
>
>>>Hi.
>>>I was just playing with Gtk2 binding for perl with a direct tie to a DB
>>>(well, was MLDBM) file and found some problems.
>>>
>>>I am attaching a perl script in UTF8 which does not work, and I think it
>>>should. Please, if it my fault explain how to solve it.
>>
>>Alberto,
>>
>>in what way does it not work? What did you expect it to do and what did it
>>actually do?
>
>
> Basically, I've set a key and data in utf8 in the db file and, when I
> retrieve them perl thinks they are normal strings (not in utf8).
> So, if my key is 'cão' (dog in Portuguese) DB will return cão as 4
No.
An SV can be in one of two states as shown by SvUTF8(sv)
If that is true then SvPV is in UTF-8, it it is false then SvPV is
octets, assumed to be in iso-8859-1 in some legacy sense.
So if XS code just reads/writes SvPV its should do one of:
A. if (SvUTF8(sv)) downgrade to bytes and croak if it can't
i.e. external file is octets, user should "encode" data
before write and "decode" after read.
e.g. call SvPVbyte()
B. if (!SvUTF8(sv)) upgrade to UTF-8 (this should not fail).
i.e. file is in "characters" held as UTF-8.
e.g. call SvPVutf8()
C. Store the state of flag with data and re-set on read.
My guess is DB_File should do (A) for data part, less clear what
to do about the key part.
:-)
> An SV can be in one of two states as shown by SvUTF8(sv)
> If that is true then SvPV is in UTF-8, it it is false then SvPV is
> octets, assumed to be in iso-8859-1 in some legacy sense.
>
> So if XS code just reads/writes SvPV its should do one of:
> A. if (SvUTF8(sv)) downgrade to bytes and croak if it can't
> i.e. external file is octets, user should "encode" data
> before write and "decode" after read.
> e.g. call SvPVbyte()
>
> B. if (!SvUTF8(sv)) upgrade to UTF-8 (this should not fail).
> i.e. file is in "characters" held as UTF-8.
> e.g. call SvPVutf8()
>
> C. Store the state of flag with data and re-set on read.
>
> My guess is DB_File should do (A) for data part, less clear what
> to do about the key part.
This is coming back to me now. See http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2002-04/msg00787.html for a discussion on this very topic. I happened just prior to 5.8.0 being released.
The jist of that thread is that encoding/decoding to/from UTF-8 is an application issue and that DBM Filters was the way to do it.
I assume that the goalposts haven't moved, and is still true to say that this is an application issue.
This reminds me - is an enhanced DBM filters module that can handle utf-8 encoding/decoding still wanted in the core? The thread reference above seemed to imply that it was. I've had a prototype up and running for a while. I just need a push to tidy it up.
Paul
Or you can add Documentation in the POD to say that this problem exists
and how to solve it (Yes, it was that way I solved my problem, so far.)
Best regards,
Alberto
You have utf-8 in both the key and the value, so you need to do this
use DB_File;
use Encode;
my $h = tie %hash, ...
$h->filter_fetch_key ( sub { $_ = Encode::decode_utf8($_) } );
$h->filter_fetch_value( sub { $_ = Encode::decode_utf8($_) } );
$h->filter_store_key ( sub { $_ = Encode::encode_utf8($_) } );
$h->filter_store_value( sub { $_ = Encode::encode_utf8($_) } );
Paul
You could check for valid utf8 but then there's the possiblity to get
false positives. I don't see a way to do the conversion automatically
without breaking oder db files.
> Or you can add Documentation in the POD to say that this problem exists
> and how to solve it (Yes, it was that way I solved my problem, so far.)
That's the way, maybe along with some predefined filters (either in
Pod or in .pm) to make the conversion easy.
Regards,
Slaven
--
__o Slaven Rezic
_`\<,_ slaven <at> rezic <dot> de
__(_)/ (_)____
______________________________________________________________________________
> OK, that can be a solution, if DB really do not work with native utf8.
> Now, I think the DB_File module can check if the string is utf_8 and try
> to latin'ize it... if it fails, croak.
Which is exactly what SvPVbyte() does for you. DB_File.xs should
basically replace its calls to SvPV() with SvPVbyte(), as Nick
suggested a few messages ago.
--Gisle
If that is the preferred method, then why do you need the ":utf8" layer when
writing utf8 to a standard files? [see perluniintro from 5.8.2]
This strikes me as being inconsistent.
Paul
Which is my (A).
>
>I assume that the goalposts haven't moved, and is still true to say that this
>is an application issue.
It is, but to be "bomb proof" you should still check that they
have done so - i.e. use SvPVbyte
You don't need :utf8 layer if you are willing to encode() yourself.
You can put a layer like mechanism on DB_File access as well if you
wish to save the user from explicit encode() or decode() calls. Using
SvPVbyte() just ensure that you don't drop information silently when
marshalling strings to storage that only support bytes. It does not
prevent adding layers later.
Regards,
Gisle
If it has :utf8 they they are not 'standard' files ;-)
However, :utf8 and :bytes are in line with all this:
:bytes (the default) causes IO to do equivalent of SvPVbyte - downgrade
or winge. (For historical reasons it is only a warn not a die.)
:utf8 tells IO to do equivalent of SvPVutf8
Doing this as a "mode" of the file handle makes sense for normal IO.
Note that non-tied hashes handle this by themselves (eventually with
some pain and false starts).
Feel free to make it a "mode" of DB_File too.
>
>This strikes me as being inconsistent.
What, perl inconsistent, - how unusual ;-)
> :bytes (the default) causes IO to do equivalent of SvPVbyte - downgrade
> or winge. (For historical reasons it is only a warn not a die.)
The historical reasons only applies to print, not to syswrite. So we
have print warn, but syswrite die.
> What, perl inconsistent, - how unusual ;-)
Regards,
Gisle
Well I do have both a "utf8" and a general "encoding" filter written for my new DBM Filters module.
> >
> >This strikes me as being inconsistent.
>
> What, perl inconsistent, - how unusual ;-)
As long as it isn't inconsistency for its own sake :-)
I'm keen to get this sorted out once and for all, both for DB_File and for all the other *DB*_File modules. At the moment I'm leaning towards providing canned DBM filters to handle the utf8 problem, but I need to find some time to see what interaction SvPVbyte() has, if any, with the existing DBM filters hooks.
I guess the things I need to convince myself of before I decide one way or the other are:
1. Adding SvPVbyte() won't break existing code.
Is anyone relying on the existing behaviour?
2. Does adding SvPVbyte() (without the aid of a DBM filter or someone explicitly calling encode on the returned data) mean that round-trip integrity of the data is maintained, i.e. if I write some utf-8 data to a DBM file, then read it back, will it be identical to the original data?
3. Can it co-exist with the existing DBM Filters hooks?
I think the trick here would be to make sure that the SvPVbyte() call is done after any filters have been invoked. I'll need to dig into the code to see if this is an issue.
4. DB_File has a parallel existence on CPAN.
Whatever is done, DB_File still needs to be able to be built with legacy versions of Perl.
I'll have a play with SvPVbyte() and report back.
Paul
> This reminds me - is an enhanced DBM filters module that can handle utf-8 encoding/decoding still wanted in the core? The thread reference above seemed to imply that it was. I've had a prototype up and running for a while. I just need a push to tidy it up.
I'm only go to integrate what I find in blead :-)
I can this specific filter as being useful to applications.
Would it also serve as a useful example of how to write a filter?
How large (small?) is it?
Nicholas Clark
Well we can try.
>At the moment I'm leaning towards providing
>canned DBM filters to handle the utf8 problem, but I need to find some time
>to see what interaction SvPVbyte() has, if any, with the existing DBM
>filters hooks.
>
>I guess the things I need to convince myself of before I decide one way or the
>other are:
>
>1. Adding SvPVbyte() won't break existing code. Is anyone relying on the
> existing behaviour?
I don't think it can _break_ existing code. They can't really be relying
on the "use whatever SvPV happens to be" behaviour because it isn't reliable.
Their code may be working by a fluke, then one day someone puts a £ or ¢
some where new and phut.
>
>2. Does adding SvPVbyte() (without the aid of a DBM filter or someone
> explicitly calling encode on the returned data) mean that round-trip
> integrity of the data is maintained, i.e. if I write some utf-8 data to a
> DBM file, then read it back, will it be identical to the original data?
Yes a round trip is maintained. If they try and write chars > U+255 to
the thing without encoding them the _write_ will croak with SvPVbyte -
so the trip gets canceled on the outbound leg.
>
>3. Can it co-exist with the existing DBM Filters hooks? I think the trick here
> would be to make sure that the SvPVbyte() call is done after any filters
> have been invoked. I'll need to dig into the code to see if this is an
> issue.
I think the SvPVbyte check should be as close to the "file" and as far from
user perl code as possible (in the "pipe"). That way a hook of some kind
can be used to do the encode.
Note that on the read side the SvPVbyte-oid is to make sure that the
SV you return to perl isn't accidentally marked as UTF8.
This is just SvUTF8_off(sv) unless you have to handle appending you
data to existing UTF8 SV.
>
>4. DB_File has a parallel existence on CPAN. Whatever is done, DB_File still
> needs to be able to be built with legacy versions of Perl.
SvPVbyte dates back (with slightly evolving semantics) to 5.6 which
introduced UTF-8-ness
What I've done is to write a pure-perl module, called DBM_Filter, that makes
use of the existing low-level DBM filter hooks present in each of the
*DB*_File modules. It is about 200 lines worth of code.
It started off life prior to 5.8.0 when there was a discussion about how
best to provide an easy way for an Encoding filter to be applied to a DBM
file. The existing DBM filters were proposed, and although they can do the
job, I thought they are a bit too low level.
That's when I thought of the DBM_Filter module. The main thing it does is
provide a framework to make it easier to write DBM filters and to put common
filters, like encoding and null termination, into modules of their own.
Taking null termination as an example (this was the original reason for
writing the DBM filters in the first place), here is what a user has to do
at present
use DB_File;
my $db = tie %hash, ...
$db->filter_fetch_key ( sub { $_ .= "\x00" } );
$db->filter_fetch_value( sub { $_ .= "\x00" } );
$db->filter_store_key ( sub { s/\x00$// } );
$db->filter_store_value( sub { s/\x00$// } );
With the DBM_Filter module, they do this
use DBM_Filter;
use DB_File ;
my $db = tie %hash, ...
$db->Filter_Push('null');
and here is the complete null filter module:
package DBM_Filter::null ;
sub Store { $_ .= "\x00" ; }
sub Fetch { s/\x00$// ; }
1;
All the hard work is done in the DBM_Filter module. It allow the filters to
be applied to the keys only, the values only, or to both keys and values. It
also allows stacking of filters.
I also have an general purpose encoding filer, which isn't much bigger than
the null filter above. Its usage is
$db->Filter_Push('encoding' => 'iso-8859-16');
A good analogy is the difference between the original source filters module
and Damian's Filter::Simple.
Paul