What is sv_utf8_upgrade() supposed to do?

Nick Ing-Simmons

unread,

Mar 7, 2004, 1:16:06 PM3/7/04

to perl5-...@perl.org

Slaven has posted an RT bug against Tk804 to the effect it does not
like substr():

https://rt.cpan.org/Ticket/Display.html?id=5529

This is pure perl version of the "bug":

#!perl
use Devel::Peek;
my $bla = ".rtsrehc\374lB";

thing(substr($bla,0));

sub thing
{
utf8::upgrade($_[0]);
print "not " unless utf8::is_utf8($_[0]);
print "ok\n";
Dump($_[0]);
}
__END__

Tk _NEEDS_ the string to be UTF-8 (because core Tk assumes all strings are)
so it uses

sv_utf8_upgrade(sv);

But in this case (as replicated above using utf8::) the upgrade
does not happen.
So what is an XS extension that wants UTF-8 supposed to use?

Sadahiro Tomoyuki

unread,

Mar 7, 2004, 6:30:29 PM3/7/04

to Nick Ing-Simmons, perl5-...@perl.org

A substr lvalue is not POK but pPOK.
I think sv_utf8_* family functions reject non-POK values.
Similar question was posted in perl-unicode about a tainted value.

http://www.xray.mpe.mpg.de/mailing-lists/perl-unicode/2004-01/msg00000.html

regards
SADAHIRO Tomoyuki

Nick Ing-Simmons

unread,

Mar 9, 2004, 11:06:31 AM3/9/04

to bqw1...@nifty.com, Nick Ing-Simmons, perl5-...@perl.org

SADAHIRO Tomoyuki <bqw1...@nifty.com> writes:
>On Sun, 07 Mar 2004 18:16:06 +0000
>Nick Ing-Simmons <ni...@ing-simmons.net> wrote:
>
>>
>> Slaven has posted an RT bug against Tk804 to the effect it does not
>> like substr():
>>
>> https://rt.cpan.org/Ticket/Display.html?id=5529
>>
>> This is pure perl version of the "bug":
>>
>> #!perl
>> use Devel::Peek;
>> my $bla = ".rtsrehc\374lB";
>>
>> thing(substr($bla,0));
>>
>> sub thing
>> {
>> utf8::upgrade($_[0]);
>> print "not " unless utf8::is_utf8($_[0]);
>> print "ok\n";
>> Dump($_[0]);
>> }
>> __END__
>>
>> Tk _NEEDS_ the string to be UTF-8 (because core Tk assumes all strings are)
>> so it uses
>>
>> sv_utf8_upgrade(sv);
>> >

>A substr lvalue is not POK but pPOK.

Yes.

>I think sv_utf8_* family functions reject non-POK values.

Yes.

>Similar question was posted in perl-unicode about a tainted value.
>
>http://www.xray.mpe.mpg.de/mailing-lists/perl-unicode/2004-01/msg00000.html

And what answer did they get?

Is every XS which NEEDS UTF-8 supposed to do its own hack to
get it?

Tk now has this:

if (SvPOKp(sv) && !SvPOK(sv))
{
if (SvTYPE(sv) == SVt_PVLV && !SvUTF8(sv))
{
/* LVs e.g. substr() don't upgrade */
SV *copy = newSVsv(sv);
sv_utf8_upgrade(copy);
sv_setsv(sv,copy);
SvREFCNT_dec(copy);
}
else
{
/* Slaven's fix magical (tied) SVs with only SvPOKp */
SvPOK_on(sv);
sv_utf8_upgrade(sv);
SvPOK_off(sv);
SvPOKp_on(sv);
}
}
return SvPVutf8_nolen(sv);

Which is getting messy - note that the 2nd clause (set POK) does not
work for LV - it results in an SV with SvUTF8 set but with octets
not encoded - so Tk's core segfaults.

>
>regards
>SADAHIRO Tomoyuki

Sadahiro Tomoyuki

unread,

Mar 12, 2004, 9:36:50 PM3/12/04

to Nick Ing-Simmons, perl5-...@perl.org

On Tue, 09 Mar 2004 16:06:31 +0000
Nick Ing-Simmons <nick.ing...@elixent.com> wrote:

> Is every XS which NEEDS UTF-8 supposed to do its own hack to
> get it?
>
> Tk now has this:
>
> if (SvPOKp(sv) && !SvPOK(sv))
> {
> if (SvTYPE(sv) == SVt_PVLV && !SvUTF8(sv))
> {
> /* LVs e.g. substr() don't upgrade */
> SV *copy = newSVsv(sv);
> sv_utf8_upgrade(copy);
> sv_setsv(sv,copy);
> SvREFCNT_dec(copy);
> }
> else
> {
> /* Slaven's fix magical (tied) SVs with only SvPOKp */
> SvPOK_on(sv);
> sv_utf8_upgrade(sv);
> SvPOK_off(sv);
> SvPOKp_on(sv);
> }
> }
> return SvPVutf8_nolen(sv);
>
> Which is getting messy - note that the 2nd clause (set POK) does not
> work for LV - it results in an SV with SvUTF8 set but with octets
> not encoded - so Tk's core segfaults.

I tried a code upgrading octets to utf8 more comprehensively.
People may want to upgrade not only POK strings but also
substring lvalues, tied values, overloaded objects, tainted values.
And arguments passed must not be broken.

After some trials and errors, I found the following codelet
will pass tests (shown below), but it may be not correct.
(I'm not sure how taintedness will be handled well.)

For perl 5.8.1 and later, all tests should succeed.
For perl 5.8.0 and before (5.6.1-5.8.0), tests 40 and 43
(for overloaded objects stringified as utf8) will fail.
Very wrong for perl 5.6.0.

--- XSUB

void
unicode_upgrade(sv)
SV * sv
PROTOTYPE: $
PREINIT:
char *s;
STRLEN len;
PPCODE:
s = SvPV(sv,len); /* mg_get(sv) happens here */
if (!SvUTF8(sv)) {
SV * tmp_sv = sv_2mortal(newSVpvn(s,len));
if (SvPOKp(tmp_sv)) /* taintedness ignored */
SvPOK_on(tmp_sv);
sv_utf8_upgrade(tmp_sv);
sv = sv_mortalcopy(tmp_sv); /* taintedness recovered */
}
XPUSHs(sv);

---TEST SUITE

use XXXX qw(unicode_upgrade); # import XSUB above

use Test;
use strict;
use warnings;

BEGIN { plan tests => 57 };

package RefString;
sub new { my $p = shift; bless [@_], $p }
use overload
'""' => sub { shift->[0] };

package TieString;
require Tie::Scalar;
our @ISA = qw(Tie::StdScalar);

package main;

sub shex { unpack('H*', shift) }

sub is_tainted {
return ! eval { eval("#" . substr(join("", @_), 0, 0)); 1 };
}

my $b = pack('C*', 0xB6);
my $u = pack('U*', 0xB6);
my $w = pack('U*', 0x100);

##### STRING BYTE : 1-3

my $str1 = pack('C*', 0xB6);
ok(shex(unicode_upgrade($str1)), shex($u));
ok(shex($str1), shex($b));
ok(shex($str1) ne shex($u));

##### STRING UTF8 : 4-6

my $str2 = pack('U*', 0xB6);
ok(shex(unicode_upgrade($str2)), shex($u));
ok(shex($str2), shex($u));
ok(shex($str2) ne shex($b));

##### STRING WIDE : 7-9

my $str3 = pack('U*', 0x100);
ok(shex(unicode_upgrade($str3)), shex($w));
ok(shex($str3), shex($w));
ok(shex($str3) ne shex($b));

##### TIE SCALAR BYTE : 10-12

my $tie1; tie $tie1, 'TieString'; $tie1 = pack('C*', 0xB6);
ok(shex(unicode_upgrade($tie1)), shex($u));
ok(shex($tie1), shex($b));
ok(tied($tie1));

##### TIE SCALAR UTF8 : 13-15

my $tie2; tie $tie2, 'TieString'; $tie2 = pack('U*', 0xB6);
ok(shex(unicode_upgrade($tie2)), shex($u));
ok(shex($tie2), shex($u));
ok(tied($tie2));

##### TIE SCALAR WIDE : 16-18

my $tie3; tie $tie3, 'TieString'; $tie3 = pack('U*', 0x100);
ok(shex(unicode_upgrade($tie3)), shex($w));
ok(shex($tie3), shex($w));
ok(tied($tie3));

##### SUBSTRING LVALUE BYTE : 19-24

my $big1 = "abc".$b."def";
ok(shex(unicode_upgrade(substr($big1,3,1))), shex($u));
ok(shex(substr($big1,3,1)), shex($b));
ok(shex($big1), shex("abc".$b."def"));

my $lv1 = \substr($big1,3,1);
ok(shex(unicode_upgrade($$lv1)), shex($u));
ok(shex($$lv1), shex($b));
ok(ref $lv1, "LVALUE");

##### SUBSTRING LVALUE UTF8 : 25-30

my $big2 = "abc".$u."def";
ok(shex(unicode_upgrade(substr($big2,3,1))), shex($u));
ok(shex(substr($big2,3,1)), shex($u));
ok(shex($big2), shex("abc".$u."def"));

my $lv2 = \substr($big2,3,1);
ok(shex(unicode_upgrade($$lv2)), shex($u));
ok(shex($$lv2), shex($u));
ok(ref $lv2, "LVALUE");

##### SUBSTRING LVALUE WIDE : 31-36

my $big3 = "abc".$w."def";
ok(shex(unicode_upgrade(substr($big3,3,1))), shex($w));
ok(shex(substr($big3,3,1)), shex($w));
ok(shex($big3), shex("abc".$w."def"));

my $lv3 = \substr($big3,3,1);
ok(shex(unicode_upgrade($$lv3)), shex($w));
ok(shex($$lv3), shex($w));
ok(ref $lv3, "LVALUE");

##### OVERLOAD BYTE : 37-39

my $ov1 = new RefString pack('C*', 0xB6);
ok(shex(unicode_upgrade($ov1)), shex($u));
ok(shex($ov1), shex($b));
ok(ref $ov1, "RefString");

##### OVERLOAD UTF8 : 40-42

my $ov2 = new RefString pack('U*', 0xB6);
ok(shex(unicode_upgrade($ov2)), shex($u));
ok(shex($ov2), shex($u));
ok(ref $ov2, "RefString");

##### OVERLOAD WIDE : 43-45

my $ov3 = new RefString pack('U*', 0x100);
ok(shex(unicode_upgrade($ov3)), shex($w));
ok(shex($ov3), shex($w));
ok(ref $ov3, "RefString");

##### TAINTED BYTE : 46-49

my $env1 = $ENV{TEMP}; substr($env1,0,length $env1) = pack('C*', 0xB6);
my $old1 = is_tainted($env1);
ok(shex(unicode_upgrade($env1)), shex($u));
ok(shex($env1), shex($b));
ok(is_tainted($env1), $old1);
ok(is_tainted(unicode_upgrade($env1)), $old1);

##### TAINTED UTF8 : 50-53

my $env2 = $ENV{TEMP}; substr($env2,0,length $env2) = pack('U*', 0xB6);
my $old2 = is_tainted($env2);
ok(shex(unicode_upgrade($env2)), shex($u));
ok(shex($env2), shex($u));
ok(is_tainted($env2), $old2);
ok(is_tainted(unicode_upgrade($env2)), $old2);

##### TAINTED WIDE : 54-57

my $env3 = $ENV{TEMP}; substr($env3,0,length $env3) = pack('U*', 0x100);
my $old3 = is_tainted($env3);
ok(shex(unicode_upgrade($env3)), shex($w));
ok(shex($env3), shex($w));
ok(is_tainted($env3), $old3);
ok(is_tainted(unicode_upgrade($env3)), $old3);

__END__

regards
SADAHIRO Tomoyuki

Yitzchak Scott-Thoennes

unread,

Mar 14, 2004, 1:52:11 AM3/14/04

to SADAHIRO Tomoyuki, Nick Ing-Simmons, perl5-...@perl.org

On Sat, Mar 13, 2004 at 11:36:50AM +0900, SADAHIRO Tomoyuki <bqw1...@nifty.com> wrote:
> For perl 5.8.1 and later, all tests should succeed.
> For perl 5.8.0 and before (5.6.1-5.8.0), tests 40 and 43
> (for overloaded objects stringified as utf8) will fail.
> Very wrong for perl 5.6.0.

IIRC, you could probably get the failing ones working with 5.8.0 by
not doing SvPV/checking UTF8 if sv is overloaded; instead do a
sv_copypv into a new mortal and then upgrade if needed.

Sadahiro Tomoyuki

unread,

Mar 14, 2004, 8:47:40 AM3/14/04

to Yitzchak Scott-Thoennes, Nick Ing-Simmons, perl5-...@perl.org

On Sat, 13 Mar 2004 22:52:11 -0800
Yitzchak Scott-Thoennes <stho...@efn.org> wrote:

> On Sat, Mar 13, 2004 at 11:36:50AM +0900, SADAHIRO Tomoyuki <bqw1...@nifty.com> wrote:
> > For perl 5.8.1 and later, all tests should succeed.
> > For perl 5.8.0 and before (5.6.1-5.8.0), tests 40 and 43
> > (for overloaded objects stringified as utf8) will fail.
> > Very wrong for perl 5.6.0.
>
> IIRC, you could probably get the failing ones working with 5.8.0 by
> not doing SvPV/checking UTF8 if sv is overloaded; instead do a
> sv_copypv into a new mortal and then upgrade if needed.

Yes, as following it seems to work finely with 5.7.3/5.8.0.
Thank you very much!

#ifdef sv_copypv /* perl 5.7.3 or later */
SV* tmp_sv = sv_newmortal();
sv_copypv(tmp_sv, sv);
#else
SV* tmp_sv = sv_2mortal(newSVpvn(s,len));
#endif

(And I wondered how SvUTF8(sv) returns true for an overloaded
object stringified as unicode. Devel::Peek does not show me
it is flagged with UTF8. If do_sv_dump() would print "UTF8"
in FLAGS even if an SV is neither POK nor POKp but ROK...)

> > --- XSUB
> >
> > void
> > unicode_upgrade(sv)
> > SV * sv
> > PROTOTYPE: $
> > PREINIT:
> > char *s;
> > STRLEN len;
> > PPCODE:
> > s = SvPV(sv,len); /* mg_get(sv) happens here */
> > if (!SvUTF8(sv)) {
> > SV * tmp_sv = sv_2mortal(newSVpvn(s,len));
> > if (SvPOKp(tmp_sv)) /* taintedness ignored */
> > SvPOK_on(tmp_sv);
> > sv_utf8_upgrade(tmp_sv);
> > sv = sv_mortalcopy(tmp_sv); /* taintedness recovered */
> > }
> > XPUSHs(sv);

Regards,
SADAHIRO Tomoyuki

Yitzchak Scott-Thoennes

unread,

Mar 14, 2004, 1:39:18 PM3/14/04

to SADAHIRO Tomoyuki, Nick Ing-Simmons, perl5-...@perl.org

On Sun, 14 Mar 2004 22:47:40 +0900, SADAHIRO Tomoyuki <bqw1...@nifty.com> wrote:
>
> On Sat, 13 Mar 2004 22:52:11 -0800
> Yitzchak Scott-Thoennes <stho...@efn.org> wrote:
>
> > On Sat, Mar 13, 2004 at 11:36:50AM +0900, SADAHIRO Tomoyuki <bqw1...@nifty.com> wrote:

> > > For perl 5.8.1 and later, all tests should succeed.
> > > For perl 5.8.0 and before (5.6.1-5.8.0), tests 40 and 43
> > > (for overloaded objects stringified as utf8) will fail.
> > > Very wrong for perl 5.6.0.
> >

> > IIRC, you could probably get the failing ones working with 5.8.0 by
> > not doing SvPV/checking UTF8 if sv is overloaded; instead do a
> > sv_copypv into a new mortal and then upgrade if needed.
>
> Yes, as following it seems to work finely with 5.7.3/5.8.0.
> Thank you very much!
>
> #ifdef sv_copypv /* perl 5.7.3 or later */
> SV* tmp_sv = sv_newmortal();
> sv_copypv(tmp_sv, sv);
> #else
> SV* tmp_sv = sv_2mortal(newSVpvn(s,len));
> #endif
>
> (And I wondered how SvUTF8(sv) returns true for an overloaded
> object stringified as unicode. Devel::Peek does not show me
> it is flagged with UTF8. If do_sv_dump() would print "UTF8"
> in FLAGS even if an SV is neither POK nor POKp but ROK...)

The UTF8 flag is set on an overloaded RV only *after* sv_2pv_flags
is called. And do_sv_dump doesn't ever do that, nor I think should it.

Rafael Garcia-Suarez

unread,

Mar 14, 2004, 5:12:04 PM3/14/04

to SADAHIRO Tomoyuki, Yitzchak Scott-Thoennes, Nick Ing-Simmons, perl5-...@perl.org

SADAHIRO Tomoyuki wrote:
> > > void
> > > unicode_upgrade(sv)

Now perhaps this function should be put somewhere in the core for
the XS authors to peruse ?

Sadahiro Tomoyuki

unread,

Mar 14, 2004, 6:25:02 PM3/14/04

to Yitzchak Scott-Thoennes, Nick Ing-Simmons, perl5-...@perl.org

Yes, that is why I should say SvPV before SvUTF8.

And what I mentioned about do_sv_dump() is about following lines:
--dump.c
if ((SvPOK(sv) || SvPOKp(sv)) && SvUTF8(sv))
sv_catpv(d, "UTF8");

If the conditional is C<if (SvUTF8(sv))> simply, in some cases
Devel::Peek would output as "FLAGS = (ROK,OVERLOAD,UTF8)".

Thank you,
SADAHIRO Tomoyuki

Yitzchak Scott-Thoennes

unread,

Mar 14, 2004, 6:37:01 PM3/14/04

to SADAHIRO Tomoyuki, Nick Ing-Simmons, perl5-...@perl.org

That would just tell you that there was a previous stringification and
it yielded UTF8; the next stringification may not. Since Dump doesn't
actually do or report on the results of the overload call, my thought
was that it makes no sense to report the UTF8 flag.

Once whatever op did the sv_2pv_flags() call is over, the UTF8 flag is
"stale" and no longer relevant.