Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[perl #118357] Stringification doesn't preserve utf8 flag

2 views
Skip to first unread message

Jiří Pavlovský

unread,
Jun 6, 2013, 5:02:10 AM6/6/13
to bugs-bi...@rt.perl.org
# New Ticket Created by Jiří Pavlovský
# Please include the string: [perl #118357]
# in the subject line of all future correspondence about this issue.
# <URL: https://rt.perl.org:443/rt3/Ticket/Display.html?id=118357 >



This is a bug report for perl from ji...@pavlovsky.eu,
generated with the help of perlbug 1.39 running under perl 5.16.3.


-----------------------------------------------------------------
[Please describe your issue here]

Stringification doesn't preserve utf8 flag

The following code prints:

n is utf8 at ./test_stringify_utf8.pl line 46.
$t->{name} is utf8 at ./test_stringify_utf8.pl line 47.
t is not utf8 at ./test_stringify_utf8.pl line 48.


#!/usr/bin/env perl

use utf8;
use Encode qw/is_utf8/;
use strict;

use Modern::Perl '2013';

package Test;
use strict;

sub new {
my ($class, $name) = @_;

my $self = { name => $name };
bless $self, $class;

return $self;
}

BEGIN {
my %OVERLOADS = (fallback => 1);

$OVERLOADS{'""'} = 'to_string';

use overload;
overload->import(%OVERLOADS);
}

sub to_string { shift->{name} }


package main;

my $n = "Derviş";
my $t = Test->new($n);

binmode STDOUT, ":utf8";

is_utf8($n) ? warn "n is utf8" : warn "n is not utf8";
is_utf8($t->{name}) ? warn '$t->{name} is utf8' : warn '$t->{name}
is not utf8';
is_utf8($t) ? warn "t is utf8" : warn "t is not utf8";



[Please do not change anything below this line]
-----------------------------------------------------------------
---
Flags:
category=library
severity=medium
module=overload
---
Site configuration information for perl 5.16.3:

Configured by gecko at Wed Mar 13 11:25:21 2013.

Summary of my perl5 (revision 5 version 16 subversion 3) configuration:

Platform:
osname=MSWin32, osvers=5.2, archname=MSWin32-x86-multi-thread
uname=''
config_args='undef'
hint=recommended, useposix=true, d_sigaction=undef
useithreads=define, usemultiplicity=define
useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef
use64bitint=undef, use64bitall=undef, uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='C:\Perl\site\bin\gcc.exe', ccflags ='-DNDEBUG -DWIN32 -D_CONSOLE
-DNO_STRICT -DPERL_TEXTMODE_SCRIPTS -DUSE_SITECUSTOMIZE
-DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO
-D_USE_32BIT_TIME_T -DHASATTRIBUTE -fno-strict-aliasing -mms-bitfields',
optimize='-O2',
cppflags='-DWIN32'
ccversion='', gccversion='3.4.5 (mingw-vista special r3)',
gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=8
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='__int64',
lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries:
ld='C:\Perl\site\bin\g++.exe', ldflags ='-L"C:\Perl\lib\CORE"'
libpth=\lib
libs=-lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32 -ladvapi32
-lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr -lwinmm
-lversion -lodbc32 -lodbccp32 -lcomctl32 -lmsvcrt
perllibs=-lkernel32 -luser32 -lgdi32 -lwinspool -lcomdlg32
-ladvapi32 -lshell32 -lole32 -loleaut32 -lnetapi32 -luuid -lws2_32 -lmpr
-lwinmm -lversion -lodbc32 -lodbccp32 -lcomctl32 -lmsvcrt
libc=msvcrt.lib, so=dll, useshrplib=true, libperl=perl516.lib
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' '
cccdlflags=' ', lddlflags='-mdll -L"C:\Perl\lib\CORE"'

Locally applied patches:
ACTIVEPERL_LOCAL_PATCHES_ENTRY

---
@INC for perl 5.16.3:
C:/Perl/site/lib
C:/Perl/lib
.

---
Environment for perl 5.16.3:
HOME=Y:\
LANG (unset)
LANGUAGE (unset)
LD_LIBRARY_PATH (unset)
LOGDIR (unset)
PATH=C:\Program Files\ActiveState Perl Dev Kit
9.2.1\bin\;C:\Perl\site\bin;C:\Perl\bin;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program
Files\TortoiseSVN\bin;c:\perl\bin;c:\mingw\bin;c:\mingw\msys\1.0\bin;c:\Program
FIles\java\jdk1.7.0_09\bin;C:\Program
Files\GNU\GnuPG\pub;c:\msys\bin;c:\MinGW\bin;c:\msys\local\bin
PERL_BADLANG (unset)
SHELL (unset)

David Golden

unread,
Jun 6, 2013, 2:57:14 PM6/6/13
to p5p
On Thu, Jun 6, 2013 at 5:02 AM, Jiří Pavlovský
<perlbug-...@perl.org> wrote:
> is_utf8($t) ? warn "t is utf8" : warn "t is not utf8";

The issue isn't that stringification doesn't preserve it, but rather
that utf8::is_utf8 doesn't coerce its argument to a string form.

Try this:
is_utf8("$t") ? warn "t is utf8" : warn "t is not utf8";

It will say "is utf8".

So either utf8::is_utf8 is incorrectly documented as taking a string,
or else it has a bug that should stringify its argument.

Is there another situation where you can demonstrate stringification
not preserving utf8?


--
David Golden <x...@xdg.me>
Take back your inbox! → http://www.bunchmail.com/
Twitter/IRC: @xdg

Jiří Pavlovský

unread,
Jun 6, 2013, 3:08:21 PM6/6/13
to perlbug-...@perl.org
On 6.6.2013 20:59, David Golden via RT wrote:
> On Thu, Jun 6, 2013 at 5:02 AM, Jiří Pavlovský
> <perlbug-...@perl.org> wrote:
>> is_utf8($t) ? warn "t is utf8" : warn "t is not utf8";
> The issue isn't that stringification doesn't preserve it, but rather
> that utf8::is_utf8 doesn't coerce its argument to a string form.
>
> Try this:
> is_utf8("$t") ? warn "t is utf8" : warn "t is not utf8";
>
> It will say "is utf8".

If I use the object before it also works:

say $t;
is_utf8($t) ? warn "t is utf8" : warn "t is not utf8";

works

But there are situations where this is not possible.

Can you advice if it is a bug with utf8, should it be reported somewhere
else? Or is it an expected behaviour?

Thank you,

--
Jiří Pavlovský

Father Chrysostomos via RT

unread,
Jun 6, 2013, 3:46:52 PM6/6/13
to perl5-...@perl.org
I’m not sure anybody really knows. :-)

I thought the purpose of this function was for introspection (e.g., in
tests) and it was supposed to give the current value of the utf8 flag.

But it already calls FETCH on tied variables, to ‘update’ the flag. So
maybe it *should* stringify refs and globs (which would also update the
flag). I don’t know....

Ultimately, though, this function should rarely be used in production
code, as the flag is merely one of the internal implementation details
(bugs aside).

--

Father Chrysostomos


---
via perlbug: queue: perl5 status: open
https://rt.perl.org:443/rt3/Ticket/Display.html?id=118357

Jiří Pavlovský

unread,
Jun 6, 2013, 4:20:11 PM6/6/13
to perlbug-...@perl.org
On 6.6.2013 21:46, Father Chrysostomos via RT wrote:
> On Thu Jun 06 12:40:50 2013, ji...@pavlovsky.eu wrote:
> I’m not sure anybody really knows. :-)
>
> I thought the purpose of this function was for introspection (e.g., in
> tests) and it was supposed to give the current value of the utf8 flag.

Actually, I'm not that much interested in is_utf8. It is just that I
distilled my real world problem to what seamed to be a problem in
stringification/utf8.

I'm using Wx::Perl. It's combobox takes an array of options. So I pass
an array of object with stringification overload.
Combobox displays stringified values and returns selected object. Works
great unless the stringified value contains accented characters.
Wx::Perl won't see utf8 flag and wont' encode the output.

Now I'm not sure on which doors to knock with this problem.


--
Jiří Pavlovský

Father Chrysostomos via RT

unread,
Jun 6, 2013, 6:56:50 PM6/6/13
to perl5-...@perl.org
I think this is a bug in Wx::Perl.

I just downloaded Wx-0.9922 from CPAN and did a quick scan.
cpp/helpers.cpp contains this, which I assume is a utility function used
by various parts of Wx::Perl:

#if wxUSE_UNICODE
static wxChar* wxPli_copy_string( SV* scalar, wxChar** )
{
dTHX;
STRLEN length;
wxWCharBuffer tmp = ( SvUTF8( scalar ) ) ?
wxConvUTF8.cMB2WX( SvPVutf8( scalar, length ) ) :
wxWCharBuffer( wxString( SvPV( scalar, length ),
wxConvLocal ).wc_str() );

wxChar* buffer = new wxChar[length + 1];
memcpy( buffer, tmp.data(), length * sizeof(wxChar) );
buffer[length] = wxT('\0');
return buffer;
}
#endif

Checking SvUTF8(scalar) before any stringification is incorrect. What
it should be doing is something like this:

dTHX;
STRLEN length;
char * const s = SvPV( scalar, length );
wxWCharBuffer tmp = ( SvUTF8( scalar ) ) ?
wxConvUTF8.cMB2WX( s ) :
wxWCharBuffer( wxString( s,
wxConvLocal ).wc_str() );

I don’t know what the wxConvLocal does, but if it does anything other
than treat the string as Latin1, then that is also incorrect, and this
would be better:

dTHX;
STRLEN length;
wxWCharBuffer tmp =
wxConvUTF8.cMB2WX( SvPVutf8( scalar, length ) );


This aspect of SvUTF8 is nothing new, as has been documented since 2006
(commit cd028baaa4):

SvUTF8 Returns a U32 value indicating the UTF-8 status of an SV. If
things are set-up properly, this indicates whether or not the
SV contains UTF-8 encoded data. You should use this after a
call to SvPV() or one of its variants, in case any call to
string overloading updates the internal flag.

(The current wording is of recent provenance and comes from commit
fd1423831.)

I don’t know enough about Wx to write a test case, so could you report
this to bug...@rt.cpan.org?

Now, as for whether utf8::is_utf8 and Encode::is_utf8 (which are
wrappers around SvUTF8, and not SvUTF8 itself, and hence unrelated to
the Wx bug) should be stringifying, it looks to me as though the
original authors intended for that, but failed to implement it that way.
But I fear that people may be relying on the current behaviour. Maybe
it’s not worth opening that can of worms.

Jiří Pavlovský

unread,
Jun 7, 2013, 4:24:04 AM6/7/13
to perlbug-...@perl.org
On 7.6.2013 0:56, Father Chrysostomos via RT wrote:
> I don’t know enough about Wx to write a test case, so could you report
> this to bug...@rt.cpan.org?


I'll do that. And thanks a lot for your help.

--
Jiří Pavlovský

0 new messages