https://rt.cpan.org/Ticket/Display.html?id=5529
This is pure perl version of the "bug":
#!perl
use Devel::Peek;
my $bla = ".rtsrehc\374lB";
thing(substr($bla,0));
sub thing
{
utf8::upgrade($_[0]);
print "not " unless utf8::is_utf8($_[0]);
print "ok\n";
Dump($_[0]);
}
__END__
Tk _NEEDS_ the string to be UTF-8 (because core Tk assumes all strings are)
so it uses
sv_utf8_upgrade(sv);
But in this case (as replicated above using utf8::) the upgrade
does not happen.
So what is an XS extension that wants UTF-8 supposed to use?
A substr lvalue is not POK but pPOK.
I think sv_utf8_* family functions reject non-POK values.
Similar question was posted in perl-unicode about a tainted value.
http://www.xray.mpe.mpg.de/mailing-lists/perl-unicode/2004-01/msg00000.html
regards
SADAHIRO Tomoyuki
Yes.
>I think sv_utf8_* family functions reject non-POK values.
Yes.
>Similar question was posted in perl-unicode about a tainted value.
>
>http://www.xray.mpe.mpg.de/mailing-lists/perl-unicode/2004-01/msg00000.html
And what answer did they get?
Is every XS which NEEDS UTF-8 supposed to do its own hack to
get it?
Tk now has this:
if (SvPOKp(sv) && !SvPOK(sv))
{
if (SvTYPE(sv) == SVt_PVLV && !SvUTF8(sv))
{
/* LVs e.g. substr() don't upgrade */
SV *copy = newSVsv(sv);
sv_utf8_upgrade(copy);
sv_setsv(sv,copy);
SvREFCNT_dec(copy);
}
else
{
/* Slaven's fix magical (tied) SVs with only SvPOKp */
SvPOK_on(sv);
sv_utf8_upgrade(sv);
SvPOK_off(sv);
SvPOKp_on(sv);
}
}
return SvPVutf8_nolen(sv);
Which is getting messy - note that the 2nd clause (set POK) does not
work for LV - it results in an SV with SvUTF8 set but with octets
not encoded - so Tk's core segfaults.
>
>regards
>SADAHIRO Tomoyuki
> Is every XS which NEEDS UTF-8 supposed to do its own hack to
> get it?
>
> Tk now has this:
>
> if (SvPOKp(sv) && !SvPOK(sv))
> {
> if (SvTYPE(sv) == SVt_PVLV && !SvUTF8(sv))
> {
> /* LVs e.g. substr() don't upgrade */
> SV *copy = newSVsv(sv);
> sv_utf8_upgrade(copy);
> sv_setsv(sv,copy);
> SvREFCNT_dec(copy);
> }
> else
> {
> /* Slaven's fix magical (tied) SVs with only SvPOKp */
> SvPOK_on(sv);
> sv_utf8_upgrade(sv);
> SvPOK_off(sv);
> SvPOKp_on(sv);
> }
> }
> return SvPVutf8_nolen(sv);
>
> Which is getting messy - note that the 2nd clause (set POK) does not
> work for LV - it results in an SV with SvUTF8 set but with octets
> not encoded - so Tk's core segfaults.
I tried a code upgrading octets to utf8 more comprehensively.
People may want to upgrade not only POK strings but also
substring lvalues, tied values, overloaded objects, tainted values.
And arguments passed must not be broken.
After some trials and errors, I found the following codelet
will pass tests (shown below), but it may be not correct.
(I'm not sure how taintedness will be handled well.)
For perl 5.8.1 and later, all tests should succeed.
For perl 5.8.0 and before (5.6.1-5.8.0), tests 40 and 43
(for overloaded objects stringified as utf8) will fail.
Very wrong for perl 5.6.0.
--- XSUB
void
unicode_upgrade(sv)
SV * sv
PROTOTYPE: $
PREINIT:
char *s;
STRLEN len;
PPCODE:
s = SvPV(sv,len); /* mg_get(sv) happens here */
if (!SvUTF8(sv)) {
SV * tmp_sv = sv_2mortal(newSVpvn(s,len));
if (SvPOKp(tmp_sv)) /* taintedness ignored */
SvPOK_on(tmp_sv);
sv_utf8_upgrade(tmp_sv);
sv = sv_mortalcopy(tmp_sv); /* taintedness recovered */
}
XPUSHs(sv);
---TEST SUITE
use XXXX qw(unicode_upgrade); # import XSUB above
use Test;
use strict;
use warnings;
BEGIN { plan tests => 57 };
package RefString;
sub new { my $p = shift; bless [@_], $p }
use overload
'""' => sub { shift->[0] };
package TieString;
require Tie::Scalar;
our @ISA = qw(Tie::StdScalar);
package main;
sub shex { unpack('H*', shift) }
sub is_tainted {
return ! eval { eval("#" . substr(join("", @_), 0, 0)); 1 };
}
my $b = pack('C*', 0xB6);
my $u = pack('U*', 0xB6);
my $w = pack('U*', 0x100);
##### STRING BYTE : 1-3
my $str1 = pack('C*', 0xB6);
ok(shex(unicode_upgrade($str1)), shex($u));
ok(shex($str1), shex($b));
ok(shex($str1) ne shex($u));
##### STRING UTF8 : 4-6
my $str2 = pack('U*', 0xB6);
ok(shex(unicode_upgrade($str2)), shex($u));
ok(shex($str2), shex($u));
ok(shex($str2) ne shex($b));
##### STRING WIDE : 7-9
my $str3 = pack('U*', 0x100);
ok(shex(unicode_upgrade($str3)), shex($w));
ok(shex($str3), shex($w));
ok(shex($str3) ne shex($b));
##### TIE SCALAR BYTE : 10-12
my $tie1; tie $tie1, 'TieString'; $tie1 = pack('C*', 0xB6);
ok(shex(unicode_upgrade($tie1)), shex($u));
ok(shex($tie1), shex($b));
ok(tied($tie1));
##### TIE SCALAR UTF8 : 13-15
my $tie2; tie $tie2, 'TieString'; $tie2 = pack('U*', 0xB6);
ok(shex(unicode_upgrade($tie2)), shex($u));
ok(shex($tie2), shex($u));
ok(tied($tie2));
##### TIE SCALAR WIDE : 16-18
my $tie3; tie $tie3, 'TieString'; $tie3 = pack('U*', 0x100);
ok(shex(unicode_upgrade($tie3)), shex($w));
ok(shex($tie3), shex($w));
ok(tied($tie3));
##### SUBSTRING LVALUE BYTE : 19-24
my $big1 = "abc".$b."def";
ok(shex(unicode_upgrade(substr($big1,3,1))), shex($u));
ok(shex(substr($big1,3,1)), shex($b));
ok(shex($big1), shex("abc".$b."def"));
my $lv1 = \substr($big1,3,1);
ok(shex(unicode_upgrade($$lv1)), shex($u));
ok(shex($$lv1), shex($b));
ok(ref $lv1, "LVALUE");
##### SUBSTRING LVALUE UTF8 : 25-30
my $big2 = "abc".$u."def";
ok(shex(unicode_upgrade(substr($big2,3,1))), shex($u));
ok(shex(substr($big2,3,1)), shex($u));
ok(shex($big2), shex("abc".$u."def"));
my $lv2 = \substr($big2,3,1);
ok(shex(unicode_upgrade($$lv2)), shex($u));
ok(shex($$lv2), shex($u));
ok(ref $lv2, "LVALUE");
##### SUBSTRING LVALUE WIDE : 31-36
my $big3 = "abc".$w."def";
ok(shex(unicode_upgrade(substr($big3,3,1))), shex($w));
ok(shex(substr($big3,3,1)), shex($w));
ok(shex($big3), shex("abc".$w."def"));
my $lv3 = \substr($big3,3,1);
ok(shex(unicode_upgrade($$lv3)), shex($w));
ok(shex($$lv3), shex($w));
ok(ref $lv3, "LVALUE");
##### OVERLOAD BYTE : 37-39
my $ov1 = new RefString pack('C*', 0xB6);
ok(shex(unicode_upgrade($ov1)), shex($u));
ok(shex($ov1), shex($b));
ok(ref $ov1, "RefString");
##### OVERLOAD UTF8 : 40-42
my $ov2 = new RefString pack('U*', 0xB6);
ok(shex(unicode_upgrade($ov2)), shex($u));
ok(shex($ov2), shex($u));
ok(ref $ov2, "RefString");
##### OVERLOAD WIDE : 43-45
my $ov3 = new RefString pack('U*', 0x100);
ok(shex(unicode_upgrade($ov3)), shex($w));
ok(shex($ov3), shex($w));
ok(ref $ov3, "RefString");
##### TAINTED BYTE : 46-49
my $env1 = $ENV{TEMP}; substr($env1,0,length $env1) = pack('C*', 0xB6);
my $old1 = is_tainted($env1);
ok(shex(unicode_upgrade($env1)), shex($u));
ok(shex($env1), shex($b));
ok(is_tainted($env1), $old1);
ok(is_tainted(unicode_upgrade($env1)), $old1);
##### TAINTED UTF8 : 50-53
my $env2 = $ENV{TEMP}; substr($env2,0,length $env2) = pack('U*', 0xB6);
my $old2 = is_tainted($env2);
ok(shex(unicode_upgrade($env2)), shex($u));
ok(shex($env2), shex($u));
ok(is_tainted($env2), $old2);
ok(is_tainted(unicode_upgrade($env2)), $old2);
##### TAINTED WIDE : 54-57
my $env3 = $ENV{TEMP}; substr($env3,0,length $env3) = pack('U*', 0x100);
my $old3 = is_tainted($env3);
ok(shex(unicode_upgrade($env3)), shex($w));
ok(shex($env3), shex($w));
ok(is_tainted($env3), $old3);
ok(is_tainted(unicode_upgrade($env3)), $old3);
__END__
regards
SADAHIRO Tomoyuki
IIRC, you could probably get the failing ones working with 5.8.0 by
not doing SvPV/checking UTF8 if sv is overloaded; instead do a
sv_copypv into a new mortal and then upgrade if needed.
> On Sat, Mar 13, 2004 at 11:36:50AM +0900, SADAHIRO Tomoyuki <bqw1...@nifty.com> wrote:
> > For perl 5.8.1 and later, all tests should succeed.
> > For perl 5.8.0 and before (5.6.1-5.8.0), tests 40 and 43
> > (for overloaded objects stringified as utf8) will fail.
> > Very wrong for perl 5.6.0.
>
> IIRC, you could probably get the failing ones working with 5.8.0 by
> not doing SvPV/checking UTF8 if sv is overloaded; instead do a
> sv_copypv into a new mortal and then upgrade if needed.
Yes, as following it seems to work finely with 5.7.3/5.8.0.
Thank you very much!
#ifdef sv_copypv /* perl 5.7.3 or later */
SV* tmp_sv = sv_newmortal();
sv_copypv(tmp_sv, sv);
#else
SV* tmp_sv = sv_2mortal(newSVpvn(s,len));
#endif
(And I wondered how SvUTF8(sv) returns true for an overloaded
object stringified as unicode. Devel::Peek does not show me
it is flagged with UTF8. If do_sv_dump() would print "UTF8"
in FLAGS even if an SV is neither POK nor POKp but ROK...)
> > --- XSUB
> >
> > void
> > unicode_upgrade(sv)
> > SV * sv
> > PROTOTYPE: $
> > PREINIT:
> > char *s;
> > STRLEN len;
> > PPCODE:
> > s = SvPV(sv,len); /* mg_get(sv) happens here */
> > if (!SvUTF8(sv)) {
> > SV * tmp_sv = sv_2mortal(newSVpvn(s,len));
> > if (SvPOKp(tmp_sv)) /* taintedness ignored */
> > SvPOK_on(tmp_sv);
> > sv_utf8_upgrade(tmp_sv);
> > sv = sv_mortalcopy(tmp_sv); /* taintedness recovered */
> > }
> > XPUSHs(sv);
Regards,
SADAHIRO Tomoyuki
The UTF8 flag is set on an overloaded RV only *after* sv_2pv_flags
is called. And do_sv_dump doesn't ever do that, nor I think should it.
Now perhaps this function should be put somewhere in the core for
the XS authors to peruse ?
Yes, that is why I should say SvPV before SvUTF8.
And what I mentioned about do_sv_dump() is about following lines:
--dump.c
if ((SvPOK(sv) || SvPOKp(sv)) && SvUTF8(sv))
sv_catpv(d, "UTF8");
If the conditional is C<if (SvUTF8(sv))> simply, in some cases
Devel::Peek would output as "FLAGS = (ROK,OVERLOAD,UTF8)".
Thank you,
SADAHIRO Tomoyuki
That would just tell you that there was a previous stringification and
it yielded UTF8; the next stringification may not. Since Dump doesn't
actually do or report on the results of the overload call, my thought
was that it makes no sense to report the UTF8 flag.
Once whatever op did the sv_2pv_flags() call is over, the UTF8 flag is
"stale" and no longer relevant.