UTF-8 failures

H.Merijn Brand

unread,

Mar 8, 2004, 6:34:20 PM3/8/04

to rgarci...@free.fr, Abe Timmerman

it's $PERL_UNICODE

lt09:/pro/3gl/CPAN/perl-current/t 105 > ./perl harness ../ext/Fcntl/t/fcntl.t
../ext/Fcntl/t/fcntl....ok
All tests successful.
Files=1, Tests=7, 1 wallclock secs ( 0.02 cusr + 0.01 csys = 0.03 CPU)
lt09:/pro/3gl/CPAN/perl-current/t 106 > ./perl ../ext/Fcntl/t/fcntl.t
1..7
ok 1
ok 2
ok 3
ok 4
ok 5
ok 6
ok 7 # open /dev/null O_WRONLY
lt09:/pro/3gl/CPAN/perl-current/t 107 > setenv PERL_UNICODE ""
lt09:/pro/3gl/CPAN/perl-current/t 108 > ./perl ../ext/Fcntl/t/fcntl.t
1..7
ok 1
ok 2
Modification of a read-only value attempted at ../ext/Fcntl/t/fcntl.t line 21.
Exit 255
lt09:/pro/3gl/CPAN/perl-current/t 109 > ./perl harness -v ../ext/Fcntl/t/fcntl.t

../ext/Fcntl/t/fcntl....Modification of a read-only value attempted at ../ext/Fc
ntl/t/fcntl.t line 21.
1..7
ok 1
ok 2
dubious
Test returned status 255 (wstat 65280, 0xff00)
DIED. FAILED tests 3-7
Failed 5/7 tests, 28.57% okay
Failed Test Stat Wstat Total Fail Failed List of Failed
-------------------------------------------------------------------------------
../ext/Fcntl/t/fcntl.t 255 65280 7 10 142.86% 3-7
Failed 1/1 test scripts, 0.00% okay. 5/7 subtests failed, 28.57% okay.
Exit 2
lt09:/pro/3gl/CPAN/perl-current/t 110 >

--
H.Merijn Brand Amsterdam Perl Mongers (http://amsterdam.pm.org/)
using perl-5.6.1, 5.8.0, & 5.9.x, and 806 on HP-UX 10.20 & 11.00, 11i,
AIX 4.3, SuSE 8.2, and Win2k. http://www.cmve.net/~merijn/
http://archives.develooper.com/daily...@perl.org/ per...@perl.org
send smoke reports to: smokers...@perl.org, QA: http://qa.perl.org

Rafael Garcia-Suarez

unread,

Mar 9, 2004, 3:41:33 AM3/9/04

to perl5-...@perl.org

H.Merijn Brand wrote in perl.perl5.porters :

> it's $PERL_UNICODE
>
>
> lt09:/pro/3gl/CPAN/perl-current/t 105 > ./perl harness ../ext/Fcntl/t/fcntl.t
> ../ext/Fcntl/t/fcntl....ok
> All tests successful.
> Files=1, Tests=7, 1 wallclock secs ( 0.02 cusr + 0.01 csys = 0.03 CPU)
> lt09:/pro/3gl/CPAN/perl-current/t 106 > ./perl ../ext/Fcntl/t/fcntl.t
> 1..7
> ok 1
> ok 2
> ok 3
> ok 4
> ok 5
> ok 6
> ok 7 # open /dev/null O_WRONLY
> lt09:/pro/3gl/CPAN/perl-current/t 107 > setenv PERL_UNICODE ""

I can't reproduce those failures here.

NOTA : PERL_UNICODE="" is not equivalent to PERL_UNICODE being unset.
This may be a bug in the smoke suite. As perlrun says, PERL_UNICODE=""
is equivalent to PERL_UNICODE=SDL. PERL_UNICODE unset is equivalent to
PERL_UNICODE=0.

Nicholas Clark

unread,

Mar 9, 2004, 5:24:40 AM3/9/04

to Rafael Garcia-Suarez, perl5-...@perl.org

On Tue, Mar 09, 2004 at 08:41:33AM -0000, Rafael Garcia-Suarez wrote:
> H.Merijn Brand wrote in perl.perl5.porters :
> > it's $PERL_UNICODE
> >
> >
> > lt09:/pro/3gl/CPAN/perl-current/t 105 > ./perl harness ../ext/Fcntl/t/fcntl.t
> > ../ext/Fcntl/t/fcntl....ok
> > All tests successful.
> > Files=1, Tests=7, 1 wallclock secs ( 0.02 cusr + 0.01 csys = 0.03 CPU)
> > lt09:/pro/3gl/CPAN/perl-current/t 106 > ./perl ../ext/Fcntl/t/fcntl.t
> > 1..7
> > ok 1
> > ok 2
> > ok 3
> > ok 4
> > ok 5
> > ok 6
> > ok 7 # open /dev/null O_WRONLY
> > lt09:/pro/3gl/CPAN/perl-current/t 107 > setenv PERL_UNICODE ""
>
> I can't reproduce those failures here.
>
> NOTA : PERL_UNICODE="" is not equivalent to PERL_UNICODE being unset.
> This may be a bug in the smoke suite. As perlrun says, PERL_UNICODE=""
> is equivalent to PERL_UNICODE=SDL. PERL_UNICODE unset is equivalent to
> PERL_UNICODE=0.

Core bug in syswrite (and therefore presumably also send) - it's upgrading
the input buffer to utf8 if the output handle is utf8

LC_ALL=nl_NL.UTF-8 ./perl -C63 -Ilib -wle '$a = "\xFF\n"; print "Before" if $a =~ /\w/; syswrite STDOUT,$a; print "After" if $a =~ /\w/'ÿ
After

Or if the intent is that IO operations upgrade SVs to UTF8 if the file handle
is UTF8, then it's effectively an expectation bug in that SvPVutf8() will now
croak if called on a read only value.

How read only should read only be?

Nicholas Clark

h...@crypt.org

unread,

Mar 9, 2004, 6:59:09 AM3/9/04

to perl5-...@perl.org

Nicholas Clark <ni...@ccl4.org> wrote:
:Core bug in syswrite (and therefore presumably also send) - it's upgrading

:the input buffer to utf8 if the output handle is utf8
:
:LC_ALL=nl_NL.UTF-8 ./perl -C63 -Ilib -wle '$a = "\xFF\n"; print "Before" if $a =~ /\w/; syswrite STDOUT,$a; print "After" if $a =~ /\w/'ÿ
:After
:
:Or if the intent is that IO operations upgrade SVs to UTF8 if the file handle
:is UTF8, then it's effectively an expectation bug in that SvPVutf8() will now
:croak if called on a read only value.
:
:How read only should read only be?

Well, we don't croak if you use a read-only C< 2 > in a string context,
so I don't think we should here either.

Does the test give more consistent results if you use /[[:alpha:]]/?

Hugo

Nicholas Clark

unread,

Mar 9, 2004, 8:38:04 AM3/9/04

to h...@crypt.org, perl5-...@perl.org

On Tue, Mar 09, 2004 at 11:59:09AM +0000, h...@crypt.org wrote:
> Nicholas Clark <ni...@ccl4.org> wrote:
> :Core bug in syswrite (and therefore presumably also send) - it's upgrading
> :the input buffer to utf8 if the output handle is utf8
> :
> :LC_ALL=nl_NL.UTF-8 ./perl -C63 -Ilib -wle '$a = "\xFF\n"; print "Before" if $a =~ /\w/; syswrite STDOUT,$a; print "After" if $a =~ /\w/'ÿ
> :After
> :
> :Or if the intent is that IO operations upgrade SVs to UTF8 if the file handle
> :is UTF8, then it's effectively an expectation bug in that SvPVutf8() will now
> :croak if called on a read only value.
> :
> :How read only should read only be?
>
> Well, we don't croak if you use a read-only C< 2 > in a string context,
> so I don't think we should here either.

It seems to be a bug in pp_send (which is used to implement syswrite) where
it fails to make a copy before upgrading to UTF8.

> Does the test give more consistent results if you use /[[:alpha:]]/?

No, syswrite upgrades $_[1] to UTF8, so only the /[[:alpha:]]/ after syswrite
matches.

I think that syswrite should be taking care not to UTF8 upgrade the argument
as a side effect, given that many other parts of perl take measures to avoid
this upgrade (eg sv_eq). That's what I was thinking.

Nicholas Clark

Yitzchak Scott-Thoennes

unread,

Mar 9, 2004, 10:45:08 AM3/9/04

to h...@crypt.org, perl5-...@perl.org

On Tue, Mar 09, 2004 at 11:59:09AM +0000, h...@crypt.org wrote:
> Nicholas Clark <ni...@ccl4.org> wrote:

> :How read only should read only be?

I don't like the idea of upgrading a readonly to utf8, but...

> Well, we don't croak if you use a read-only C< 2 > in a string context,
> so I don't think we should here either.

We even upgrade a constant:

$ perl -MDevel::Peek -we'$x = \2; Dump $$x; "$$x" and Dump $$x'
SV = IV(0xa05b250) at 0xa041244
REFCNT = 2
FLAGS = (IOK,READONLY,pIOK)
IV = 2
SV = PVIV(0xa042040) at 0xa041244
REFCNT = 2
FLAGS = (IOK,POK,READONLY,pIOK,pPOK)
IV = 2
PV = 0xa042828 "2"\0
CUR = 1
LEN = 2

That actually surprises me.

Rafael Garcia-Suarez

unread,

Mar 9, 2004, 10:53:07 AM3/9/04

to perl5-...@perl.org

Yitzchak Scott-Thoennes wrote in perl.perl5.porters :

> On Tue, Mar 09, 2004 at 11:59:09AM +0000, h...@crypt.org wrote:
>> Nicholas Clark <ni...@ccl4.org> wrote:
>> :How read only should read only be?
>
> I don't like the idea of upgrading a readonly to utf8, but...
>
>> Well, we don't croak if you use a read-only C< 2 > in a string context,
>> so I don't think we should here either.
>
> We even upgrade a constant:

Well, why not ? the value is constant, but its internal storage
form may change. (at least there is no magic added to it :)

Dave Mitchell

unread,

Mar 9, 2004, 10:56:11 AM3/9/04

to Yitzchak Scott-Thoennes, h...@crypt.org, perl5-...@perl.org

From a philosophical point of view, it could be argued that Perl is free
to change the internal represtantations of constants to its heart's
content, as long as its external meaning is unchanged. From that
perspective its okay for a RO value to be upgraded to utf8 (as long as its
not (SvREADONLY && SvFAKE). Of course there may be efficiency reasons
for avoiding this where possible.

There may also be externally visiable side-effects of this which I
haven't though of.

--
"Strange women lying in ponds distributing swords is no basis for a system
of government. Supreme executive power derives from a mandate from the
masses, not from some farcical aquatic ceremony."
-- Dennis - Monty Python and the Holy Grail.

Elizabeth Mattijsen

unread,

Mar 9, 2004, 11:27:16 AM3/9/04

to Dave Mitchell, Yitzchak Scott-Thoennes, h...@crypt.org, perl5-...@perl.org

At 15:56 +0000 3/9/04, Dave Mitchell wrote:
>From a philosophical point of view, it could be argued that Perl is free
>to change the internal represtantations of constants to its heart's
>content, as long as its external meaning is unchanged. From that
>perspective its okay for a RO value to be upgraded to utf8 (as long as its
>not (SvREADONLY && SvFAKE). Of course there may be efficiency reasons
>for avoiding this where possible.
>
>There may also be externally visiable side-effects of this which I
>haven't though of.

Since concatenation with a UTF8 string will automatically upgrade the
resulting string. I wonder whether there is a constant version of
this situation thinkable:

use Devel::Peek qw(Dump);
use Encode qw(_utf8_on);

my $foo = 'foo';
my $bar = 'bar';
my $baz = $foo.$bar;

Dump $baz; # not upgraded

_utf8_on( $foo );
my $baz = $foo.$bar;

Dump $baz; #upgraded

Now, if $foo were a constant, then I would expect the result of a
contatenation with that constant to always be the same. If however
the constant can be upgraded, the result may _not_ be always the
same. Contrary to expectations one might have with regards
constants... ;-)

Liz

Nick Ing-Simmons

unread,

Mar 9, 2004, 11:55:21 AM3/9/04

to ni...@ccl4.org, perl5-...@perl.org, Rafael Garcia-Suarez

Nicholas Clark <ni...@ccl4.org> writes:
>
>Core bug in syswrite (and therefore presumably also send) - it's upgrading
>the input buffer to utf8 if the output handle is utf8
>

>LC_ALL=nl_NL.UTF-8 ./perl -C63 -Ilib -wle '$a = "\xFF\n"; print "Before" if $a =~ /\w/; syswrite STDOUT,$a; print "After" if $a =~ /\w/'Ã¿

>After
>
>Or if the intent is that IO operations upgrade SVs to UTF8 if the file handle
>is UTF8,

My original expectation was that UTF8-ness of an SV was an internal detail
that perl core could change at will if it helped. So I expected
code to have to tolerate changes. Simon Cozens took a different view
and I lost the enthusiasm to argue. So now things like :

print "£" for (1..100)

Keep re-encoding to temp, print and free, while I expected 1st time
to upgrade it and rest to use it.

>then it's effectively an expectation bug in that SvPVutf8() will now
>croak if called on a read only value.

Which isn't going to help things like Tk and Encode which need a
UTF-8 sequence.
It shouldn't croak, if mutating the SV is now deemed non-PC then
SvPVutf8 should allocate a new string and use a SAVE_XXX() to free it
on next LEAVE. Or we should deprecate the API (and its twin SvPVbytes
which does the equally modifying downgrade). And tell anyone that to
get UTF-8 one must do

SV *copy = newSVsv(sv);
U8 *s;
sv_utf8_upgrade(copy);
s = SvPV(copy);
...
SvREFCNT_dec(copy);

Or something even more extream involving creating a UTF-8 Encode
object and making method calls...

>
>How read only should read only be?

How UTF-8 should SvPVutf8 be?

>
>Nicholas Clark

Nick Ing-Simmons

unread,

Mar 9, 2004, 11:59:05 AM3/9/04

to ni...@ccl4.org, h...@crypt.org, perl5-...@perl.org

Nicholas Clark <ni...@ccl4.org> writes:
>
>No, syswrite upgrades $_[1] to UTF8, so only the /[[:alpha:]]/ after syswrite
>matches.
>
>I think that syswrite should be taking care not to UTF8 upgrade the argument
>as a side effect, given that many other parts of perl take measures to avoid
>this upgrade (eg sv_eq).

'twas Simon that started that

Nicholas Clark

unread,

Mar 9, 2004, 12:03:03 PM3/9/04

to Yitzchak Scott-Thoennes, h...@crypt.org, perl5-...@perl.org, Nick Ing-Simmons, Rafael Garcia-Suarez

On Tue, Mar 09, 2004 at 03:56:11PM +0000, Dave Mitchell wrote:

> There may also be externally visiable side-effects of this which I
> haven't though of.

see below. there is one.

On Tue, Mar 09, 2004 at 04:55:21PM +0000, Nick Ing-Simmons wrote:

> My original expectation was that UTF8-ness of an SV was an internal detail
> that perl core could change at will if it helped. So I expected
> code to have to tolerate changes. Simon Cozens took a different view
> and I lost the enthusiasm to argue. So now things like :
>
> print "£" for (1..100)
>
> Keep re-encoding to temp, print and free, while I expected 1st time
> to upgrade it and rest to use it.

It would be entirely an implementation detail if changing the internal
representation didn't effect how /\w/ and /[[:alpha:]]/ matched.

> >How read only should read only be?
>

> How UTF-8 should SvPVutf8 be?

Bugger.

Nicholas Clark

Nick Ing-Simmons

unread,

Mar 9, 2004, 12:11:17 PM3/9/04

to da...@fdisolutions.com, h...@crypt.org, perl5-...@perl.org, Yitzchak Scott-Thoennes

Dave Mitchell <da...@fdisolutions.com> writes:
>
>>From a philosophical point of view, it could be argued that Perl is free
>to change the internal represtantations of constants to its heart's
>content, as long as its external meaning is unchanged. From that
>perspective its okay for a RO value to be upgraded to utf8 (as long as its
>not (SvREADONLY && SvFAKE). Of course there may be efficiency reasons
>for avoiding this where possible.

There may also be efficiency reasons for doing it ;-)

Nick Ing-Simmons

unread,

Mar 9, 2004, 12:24:37 PM3/9/04

to l...@dijkmat.nl, Dave Mitchell, h...@crypt.org, perl5-...@perl.org, Yitzchak Scott-Thoennes

Elizabeth Mattijsen <l...@dijkmat.nl> writes:
>>There may also be externally visiable side-effects of this which I
>>haven't though of.
>
>Since concatenation with a UTF8 string will automatically upgrade the
>resulting string. I wonder whether there is a constant version of
>this situation thinkable:
>
>use Devel::Peek qw(Dump);
>use Encode qw(_utf8_on);
>
>my $foo = 'foo';
>my $bar = 'bar';
>my $baz = $foo.$bar;
>
>Dump $baz; # not upgraded
>
>
>_utf8_on( $foo );
>my $baz = $foo.$bar;
>
>Dump $baz; #upgraded
>
>
>Now, if $foo were a constant, then I would expect the result of a
>contatenation with that constant to always be the same.

Using Dump() is cheating - it is looking at internal state.
It should result in same sequence of characters either way,
but internal representation might be different.

Elizabeth Mattijsen

unread,

Mar 9, 2004, 12:30:17 PM3/9/04

to Nick Ing-Simmons, Dave Mitchell, h...@crypt.org, perl5-...@perl.org, Yitzchak Scott-Thoennes

At 17:24 +0000 3/9/04, Nick Ing-Simmons wrote:

>Elizabeth Mattijsen <l...@dijkmat.nl> writes:
> >Now, if $foo were a constant, then I would expect the result of a
>>contatenation with that constant to always be the same.
>
>Using Dump() is cheating - it is looking at internal state.
>It should result in same sequence of characters either way,
>but internal representation might be different.

Yes, but if the internal representation is different, the least you
may find are execution speed differences for regexes on the
concatenated string.

Liz

Nick Ing-Simmons

unread,

Mar 9, 2004, 12:31:04 PM3/9/04

to ni...@ccl4.org, Nick Ing-Simmons, h...@crypt.org, Rafael Garcia-Suarez, perl5-...@perl.org, Yitzchak Scott-Thoennes

Nicholas Clark <ni...@ccl4.org> writes:
>
>It would be entirely an implementation detail if changing the internal
>representation didn't effect how /\w/ and /[[:alpha:]]/ matched.

Bugger -
presumably
SvUTF8 -> Unicode rules
!SvUTF8 -> locale rules

I thought you had to 'use locale' to get local-oid rules, and without
it we use iso-8859-1 which is supposed to be a strict subset of Unicode?

Nick Ing-Simmons

unread,

Mar 9, 2004, 12:36:37 PM3/9/04

to l...@dijkmat.nl, Dave Mitchell, Nick Ing-Simmons, h...@crypt.org, perl5-...@perl.org, Yitzchak Scott-Thoennes

Well if we could mutate things then regex would be at liberty
to convert it if it knew it would help ;-)

We need a "policy" on this one way or the other.

>
>
>Liz

Yitzchak Scott-Thoennes

unread,

Mar 9, 2004, 12:42:26 PM3/9/04

to Nick Ing-Simmons, ni...@ccl4.org, h...@crypt.org, Rafael Garcia-Suarez, perl5-...@perl.org

On Tue, Mar 09, 2004 at 05:31:04PM +0000, Nick Ing-Simmons <nick.ing...@elixent.com> wrote:
> Nicholas Clark <ni...@ccl4.org> writes:
> >
> >It would be entirely an implementation detail if changing the internal
> >representation didn't effect how /\w/ and /[[:alpha:]]/ matched.
>
> Bugger -
> presumably
> SvUTF8 -> Unicode rules
> !SvUTF8 -> locale rules
>
> I thought you had to 'use locale' to get local-oid rules, and without
> it we use iso-8859-1 which is supposed to be a strict subset of Unicode?

For !SvUTF8 and no locale, you get the C locale, which is more limited
than iso-8859-1.

Nicholas Clark

unread,

Mar 9, 2004, 1:06:15 PM3/9/04

to Yitzchak Scott-Thoennes, Nick Ing-Simmons, h...@crypt.org, Rafael Garcia-Suarez, perl5-...@perl.org

Which means that our specific "problem" (in as much as it's pure perl
visible) is that the interpretation of bytes er characters er bytes in
the range 160 to 255 changes depending on whether the SV happens to internally
be marked as utf8 or not. The internals have only one flag, which is being
used both to represent encoding, and character set (er, whatever)

I'm told that some CPAN modules misbehave if you pass them data in utf8 -
ie least surprise is to keep 8 bit data as 8 bit.

Something I'm wondering is if we can cache the other representation somewhere
so that the 2utf8 and 2bytes functions can return it. In theory all of the
cache reset for the UTF8 side should already exist because of Jarkko's offset
caching code.

But I'm thinking out loud here.

Nicholas Clark

Gisle Aas

unread,

Mar 9, 2004, 2:07:00 PM3/9/04

to Nick Ing-Simmons, ni...@ccl4.org, perl5-...@perl.org, Rafael Garcia-Suarez

Nick Ing-Simmons <nick.ing...@elixent.com> writes:

> My original expectation was that UTF8-ness of an SV was an internal detail
> that perl core could change at will if it helped. So I expected
> code to have to tolerate changes. Simon Cozens took a different view
> and I lost the enthusiasm to argue.

I'm totally with Nick on this. Seems like Larry agrees too. Quoting a
recent message from Larry [1]:

| If I recall correctly, it's because the pumpking of the time thought
| that backward compatibility was more important than consistency,
| and gave the internal 8-bit representation different semantics than
| the corresponding utf8 representation. I think this was a mistake,
| personally.

My main wish for perl would be to have the byte semantics to just go
away. It makes it hard to explain how strings work and creates lot of
complications for the implementation. EBCDIC compatibility is better
handled at the edge, so that we can have a single and simple model for
strings internally.

If that direction is not acceptable, I would try to find another flag
bit that can be used to explicitly encode byte semantic. That way we
can actually make utf8_{upgrade,downgrade} semantic noops. The
current solution with short lived temp copies all over the place seems
awfully expensive.

Regards,
Gisle

[1] http://www.nntp.perl.org/group/perl.unicode/2431

Nick Ing-Simmons

unread,

Mar 9, 2004, 2:18:07 PM3/9/04

to stho...@efn.org, ni...@ccl4.org, Nick Ing-Simmons, h...@crypt.org, perl5-...@perl.org, Rafael Garcia-Suarez

How catastrophic would it be to change that - after all only
Americans would notice ;-)

h...@crypt.org

unread,

Mar 9, 2004, 2:46:22 PM3/9/04

to perl5-...@perl.org

Gisle Aas <gi...@ActiveState.com> wrote:

:Nick Ing-Simmons <nick.ing...@elixent.com> writes:
:
:> My original expectation was that UTF8-ness of an SV was an internal detail
:> that perl core could change at will if it helped. So I expected
:> code to have to tolerate changes. Simon Cozens took a different view
:> and I lost the enthusiasm to argue.
:
:I'm totally with Nick on this. Seems like Larry agrees too. Quoting a
:recent message from Larry [1]:
:
:| If I recall correctly, it's because the pumpking of the time thought
:| that backward compatibility was more important than consistency,
:| and gave the internal 8-bit representation different semantics than
:| the corresponding utf8 representation. I think this was a mistake,
:| personally.

In principle I would have no problem with mandating that utf8ness is
an attribute of the internal form, and that the programmer should
never need to know or care what the internal form of a string might
be at any point in a program.

In practice, there are a small number of corner cases (such as /\w/)
that cause a problem, and we'd need to have a consistent definition
of them - ideally a definition that permitted backward compatibility.

I suspect that there is enough code out there relying on pre-Unicode
semantics that it is no longer practical to hope to resolve this within
the perl5 track, and it may be therefore that you will have to wait for
perl6 to get the true benefits of "Unicode everywhere".

:My main wish for perl would be to have the byte semantics to just go

:away. It makes it hard to explain how strings work and creates lot of
:complications for the implementation. EBCDIC compatibility is better
:handled at the edge, so that we can have a single and simple model for
:strings internally.

I don't fully understand what you are suggesting here - are you talking
about removing the distinction in internal storage? Or only about
changing what is exposed to the programmer?

Hugo

Nick Ing-Simmons

unread,

Mar 9, 2004, 3:35:20 PM3/9/04

to h...@crypt.org, perl5-...@perl.org

<h...@crypt.org> writes:
>
>In practice, there are a small number of corner cases (such as /\w/)
>that cause a problem, and we'd need to have a consistent definition
>of them - ideally a definition that permitted backward compatibility.

So if /\w/ is forever-ASCII in bytes world why isn't it the same
in Unicode world? Have we not got \p{} to give "word-ish" classes
in Unicode cannot \w mean [A-Za-z0-9_] forever?

The point is to be consistent.

>
>I suspect that there is enough code out there relying on pre-Unicode
>semantics that it is no longer practical to hope to resolve this within
>the perl5 track, and it may be therefore that you will have to wait for
>perl6 to get the true benefits of "Unicode everywhere".

Counter to what I argue above - wouldn't saying that \w has Unicode
semantics (via pragma) ease transition to perl6

>
>:My main wish for perl would be to have the byte semantics to just go
>:away. It makes it hard to explain how strings work and creates lot of
>:complications for the implementation. EBCDIC compatibility is better
>:handled at the edge, so that we can have a single and simple model for
>:strings internally.
>
>I don't fully understand what you are suggesting here - are you talking
>about removing the distinction in internal storage? Or only about
>changing what is exposed to the programmer?

Well I am saying the programmer

- perl programmer should not need to care.
e.g. No sudden surprises that ñ is now allowed in words
because the string went near som UTF-8.

- XS/C programmer should be able to consistently get bytes or UTF-8
as they request.
e.g. no sudden croaks because someone passed in a constant,
or a substr()

The dual scheme is well entrenched in perl5 now.

Yitzchak Scott-Thoennes

unread,

Mar 9, 2004, 4:17:18 PM3/9/04

to Nick Ing-Simmons, ni...@ccl4.org, Nick Ing-Simmons, h...@crypt.org, perl5-...@perl.org, Rafael Garcia-Suarez

I'd love it. cygwin only has support for the C locale, so to get regexes
to use latin1 I *have* to upgrade strings to utf8 currently.

The POSIX:: functions continue to default to C locale, but otherwise have
[ul]c(first)? and regexes default to latin1.

Rafael Garcia-Suarez

unread,

Mar 9, 2004, 4:20:46 PM3/9/04

to Yitzchak Scott-Thoennes, perl5-...@perl.org

Maybe with PERL_UNICODE=63 you could force a few strings to UTF8 :)

Chip Salzenberg

unread,

Mar 9, 2004, 4:30:45 PM3/9/04

to Nick Ing-Simmons, h...@crypt.org, perl5-...@perl.org

I have a design question.

Conversion from bytes to utf8 requires assuming a locale. In the
absence of C<use locale>, it's clear to me that the locale that should
be assumed is C, not Latin-1.

Given that, why is it not *illegal* to convert bytes with the high bit
set to UTF8 automatically? (Except under C<use locale> or some other
indication that the user wants the current locale to take effect.)

If such a conversion were illegal when attempted automatically, we
wouldn't be having this conversation, would we?
--
Chip Salzenberg - a.k.a. - <ch...@pobox.com>
"I wanted to play hopscotch with the impenetrable mystery of existence,
but he stepped in a wormhole and had to go in early." // MST3K

Nick Ing-Simmons

unread,

Mar 10, 2004, 4:05:02 AM3/10/04

to ch...@pobox.com, h...@crypt.org, Nick Ing-Simmons, perl5-...@perl.org

Chip Salzenberg <ch...@pobox.com> writes:
>I have a design question.
>
>Conversion from bytes to utf8 requires assuming a locale. In the
>absence of C<use locale>, it's clear to me that the locale that should
>be assumed is C, not Latin-1.

Why?

Chip Salzenberg

unread,

Mar 10, 2004, 10:29:33 AM3/10/04

to Nick Ing-Simmons, h...@crypt.org, Nick Ing-Simmons, perl5-...@perl.org

According to Nick Ing-Simmons:

> Chip Salzenberg <ch...@pobox.com> writes:
> >Conversion from bytes to utf8 requires assuming a locale. In the
> >absence of C<use locale>, it's clear to me that the locale that should
> >be assumed is C, not Latin-1.
>
> Why?

Perl uses the C locale for bytes, and always has. It wasn't until
5.004 that Perl was even *capable* of doing otherwise (C<use locale>).

The bottom 256 codes of Unicode match Latin-1. When you take a byte
in the range of 128-255 and start calling it Unicode, you're implying
that the string was actually Latin-1, and that's just wrong.

Conversions should assume C locale, which makes bytes in the range
128-255 illegal for UTF-8 conversion.

Consider:

$nn = "\xD1"; # N-with-tilde, if it were Latin-1
die if $nn =~ /\w/; # nope, we're in the C locale, that's not \w

That's fine. But now let's convert $nn to UTF-8:

$nn = "\xD1"; # N-with-tilde, if it were Latin-1
$foo = $nn . "\x{0100}";
die if substr($foo,0,1) =~ /\w/; # Earth-shattering KABOOM!

That's just wrong. I didn't say C<use locale>. Besides, even if I
had, my actual locale may not be compatible with Latin-1! What if
my $LANG were 'Zh_cn'?

It gets worse if there are *any* circumstances in which $nn might be
converted to UTF-8 _in-place_. Please tell me there aren't.

ni...@ing-simmons.net

unread,

Mar 10, 2004, 11:46:27 AM3/10/04

to ch...@pobox.com, Nick Ing-Simmons, h...@crypt.org, perl5-...@perl.org

Chip Salzenberg <ch...@pobox.com> writes:
>According to Nick Ing-Simmons:
>> Chip Salzenberg <ch...@pobox.com> writes:
>> >Conversion from bytes to utf8 requires assuming a locale. In the
>> >absence of C<use locale>, it's clear to me that the locale that should
>> >be assumed is C, not Latin-1.
>>
>> Why?
>
>Perl uses the C locale for bytes, and always has. It wasn't until
>5.004 that Perl was even *capable* of doing otherwise (C<use locale>).
>
>The bottom 256 codes of Unicode match Latin-1. When you take a byte
>in the range of 128-255 and start calling it Unicode, you're implying
>that the string was actually Latin-1, and that's just wrong.

This however is what I am devil's-advocating - if we said
in absence of locale we assumed Latin-1 what is the breakage?

Or if (as your \xD1 example suggests) that would break things,
suppose none of U+0080..U+00FF matched \w - i.e. we apply C locale
rules things that happen to be in Unicode too.
Unless you did 'use 'unicode' akin to 'use locale' ?
_perhaps_ give this semantic to 'use utf8' but I don't
like that as it is already much mis-used)

The snag here is need to handle case where script is written
in (say) a Japanese locale, but still wants Unicode semantics of \w
so one might need a 'ctype' pragma:

#!
use encoding 'latin-1'; # script author used literal ñ etc.

sub parse_command_line
{
use ctype 'locale';
}

sub network
{
my $msg = <$socket>;
use ctype 'unicode'; # moden semantics wanted...

}

You are arguing we need at least 3 ctypes:
C - for 'ñ' !~ /\w/
Locale - mine and José's do yours doesn't
Unicode - it does and so do lots of other things...

But I believe Camel-III has said that \w means something for Unicode.

>
>Conversions should assume C locale, which makes bytes in the range
>128-255 illegal for UTF-8 conversion.
>
>Consider:
>
> $nn = "\xD1"; # N-with-tilde, if it were Latin-1
> die if $nn =~ /\w/; # nope, we're in the C locale, that's not \w
>
>That's fine. But now let's convert $nn to UTF-8:
>
> $nn = "\xD1"; # N-with-tilde, if it were Latin-1
> $foo = $nn . "\x{0100}";
> die if substr($foo,0,1) =~ /\w/; # Earth-shattering KABOOM!
>
>That's just wrong.

I agree that them differing is wrong.

>I didn't say C<use locale>. Besides, even if I
>had, my actual locale may not be compatible with Latin-1! What if
>my $LANG were 'Zh_cn'?
>
>It gets worse if there are *any* circumstances in which $nn might be
>converted to UTF-8 _in-place_. Please tell me there aren't.

This is the current snag - there are spots in core perl which do that.

$nn .= "\x{100}";

for a start but that is visible.

socket(my $fh,...)
binmode $fh,":encoding(...)";
...
send($fh,$nn);

action at a distance.

use Tk;
...
my $window = MainWindow->new(-title => $nn);

XS code ...

We have public API:

SvPVutf8(SV *sv,len)

that XS like Tk that wants UTF-8 has been told it can use.
It upgrades sv in place and returns SvPV.
(Well it tries - sometimes it doesn't and gives
you non-encoded bytes - but that is a different story.)

Chip Salzenberg

unread,

Mar 10, 2004, 12:00:59 PM3/10/04

to ni...@ing-simmons.net, h...@crypt.org, perl5-...@perl.org

According to ni...@ing-simmons.net:

> Chip Salzenberg <ch...@pobox.com> writes:
> >The bottom 256 codes of Unicode match Latin-1. When you take a byte
> >in the range of 128-255 and start calling it Unicode, you're implying
> >that the string was actually Latin-1, and that's just wrong.
>
> This however is what I am devil's-advocating - if we said
> in absence of locale we assumed Latin-1 what is the breakage?

I already demonstrated it. Advocating for the Devil is fine, but at
some point you're gonna have to plea bargain.

> Or if (as your \xD1 example suggests) that would break things,
> suppose none of U+0080..U+00FF matched \w - i.e. we apply C locale
> rules things that happen to be in Unicode too.

But that's worse (if such a thing is possible). A UTF-8 string coming
from outside that has \xD1 characters in it had bettter "Bob"damned
well match /\w/ or the users will lynch us, and rightly so.

> You are arguing we need at least 3 ctypes:

> C - for '??' !~ /\w/
> Locale - mine and Jos??'s do yours doesn't

> Unicode - it does and so do lots of other things...

No. Locale is inapplicable to Unicode (at least WRT /\w/).

1. The meaning of UTF-8 strings is locale independent.

2. The meaning of byte strings is locale dependent.

a. By default the locale "C" is assumed, for reasons of history
and efficiency.

b. Under C<use locale>, the C library locale is used (which
usually follows environment variables $LANG, $LC_ALL, etc.
but that's nothing Perl can know for sure)

3. Conversion from byte strings to UTF-8 simply *CANNOT* be cone
without assuming a locale. This is a technical fact.

4. Currently Perl converts byte codes to Unicode codes without
translation. This is an *IMPLICIT ASSUMPTION* of Latin-1.

5. Assuming Latin-1 is incompatible with Perl's historical assumption
of "C" locale.

a. It breaks code that depends on Perl's "C" default.

b. Detecting this breakage can be difficult.

6. This is a SECURITY ISSUE. Screwing with /\w/ opens security
holes.

a. Whenever we tell people to use -T, we're also telling them
that regexes are the way to guarantee that data are safe.

b. Yet a UTF-8 conversion makes /\w/ match characters that it
would otherwise not match.

c. When those other characters are passed to child processes
or used to create files, what will happen?

d. WE CANNOT KNOW what will happen.

e. We better get ready to be lynched unless we fix this.

> >It gets worse if there are *any* circumstances in which $nn might be
> >converted to UTF-8 _in-place_. Please tell me there aren't.
>
> This is the current snag - there are spots in core perl which do that.

<a href="chip_screaming.mp3>Great.</a>

H.Merijn Brand

unread,

Mar 10, 2004, 1:34:16 PM3/10/04

to Chip Salzenberg, Perl 5 Porters

On Wed 10 Mar 2004 18:00, Chip Salzenberg <ch...@pobox.com> wrote:
> > >It gets worse if there are *any* circumstances in which $nn might be
> > >converted to UTF-8 _in-place_. Please tell me there aren't.
> >
> > This is the current snag - there are spots in core perl which do that.
>
> <a href="chip_screaming.mp3>Great.</a>

s/(?<=3)/\N{QUOTATION MARK}/;

Who is sorry for opening this can of 0x80 worms

I, using Unicode on a regular basis, only in the UTF-8 encoding, expect
0x20..0xFE to be iso-8859-1 and everything else to be UTF-8. I know that this
is pretty shortsighted, but it keeps me alive and healthy.

perl -C2

will make my output do UTF-8, and

use charnames ":full", ":alias" => ...

will make my life easy

I don't expect \w to match á or ø or any diacritiacal mark outside of the
iso-8859-1. For me \w being [A-Za-z_\d] is FINE. And I don't expect more in
perl5. If I do, I will use encoding and such.

Just my € 0.02

--
H.Merijn Brand Amsterdam Perl Mongers (http://amsterdam.pm.org/)
using perl-5.6.1, 5.8.0, & 5.9.x, and 806 on HP-UX 10.20 & 11.00, 11i,
AIX 4.3, SuSE 8.2, and Win2k. http://www.cmve.net/~merijn/
http://archives.develooper.com/daily...@perl.org/ per...@perl.org
send smoke reports to: smokers...@perl.org, QA: http://qa.perl.org

Nicholas Clark

unread,

Mar 10, 2004, 1:50:11 PM3/10/04

to Rafael Garcia-Suarez, Nick Ing-Simmons, perl5-...@perl.org

On Tue, Mar 09, 2004 at 04:55:21PM +0000, Nick Ing-Simmons wrote:
> Nicholas Clark <ni...@ccl4.org> writes:

> >then it's effectively an expectation bug in that SvPVutf8() will now
> >croak if called on a read only value.
>
> Which isn't going to help things like Tk and Encode which need a
> UTF-8 sequence.
> It shouldn't croak, if mutating the SV is now deemed non-PC then
> SvPVutf8 should allocate a new string and use a SAVE_XXX() to free it
> on next LEAVE. Or we should deprecate the API (and its twin SvPVbytes
> which does the equally modifying downgrade). And tell anyone that to
> get UTF-8 one must do
>
> SV *copy = newSVsv(sv);
> U8 *s;
> sv_utf8_upgrade(copy);
> s = SvPV(copy);
> ...
> SvREFCNT_dec(copy);
>
> Or something even more extream involving creating a UTF-8 Encode
> object and making method calls...

For the moment, should I apply this:

==== //depot/perl/sv.c#725 - /home/nick/p4perl/perl/sv.c ====
@@ -3477,10 +3477,6 @@
sv_force_normal_flags(sv, 0);
}

- if (SvREADONLY(sv)) {
- Perl_croak(aTHX_ PL_no_modify);
- }
-
if (PL_encoding && !(flags & SV_UTF8_NO_ENCODING))
sv_recode_to_utf8(sv, PL_encoding);
else { /* Assume Latin-1/EBCDIC */

Which will revert things so that you can upgrade a string constant from
bytes to utf8. (which is how things were before I started faffing)

Nicholas Clark

Chip Salzenberg

unread,

Mar 10, 2004, 2:10:25 PM3/10/04

to H.Merijn Brand, Perl 5 Porters

According to H.Merijn Brand:

Statement 1:

> I, using Unicode on a regular basis, only in the UTF-8 encoding, expect
> 0x20..0xFE to be iso-8859-1 and everything else to be UTF-8.

Statement 2:

> I don't expect \w to match ? or ? or any diacritiacal mark outside of the
> iso-8859-1.

Statement 3:

> For me \w being [A-Za-z_\d] is FINE. And I don't expect more in perl5.

Statements 1, 2, and 3 are at least partially contradictory.
For example, \w of [A-Za-z_\d] is "C" locale, not 8859-1.

Rafael Garcia-Suarez

unread,

Mar 10, 2004, 4:11:57 PM3/10/04

to Nicholas Clark, Nick Ing-Simmons, perl5-...@perl.org

Nicholas Clark wrote:
>
> For the moment, should I apply this:

I think yes.
Apparently we need to think more about all this stuff.
In principle it should be ok to upgrade a constant to UTF8; after
all, that's only an internal representation. However, as you
pointed out, this conversion depends on external factors and has
external effects. And as Chip pointed out the current conversion
does not the right thing for upper bytes. IIUC if we convert
properly bytes to UTF8 wrt the C locale (unless the user specified
otherwise) this problem goes away.

Chip Salzenberg

unread,

Mar 10, 2004, 6:14:47 PM3/10/04

to Rafael Garcia-Suarez, Nicholas Clark, Nick Ing-Simmons, perl5-...@perl.org

According to Rafael Garcia-Suarez:
> As Chip pointed out the current conversion does not the right thing

> for upper bytes. IIUC if we convert properly bytes to UTF8 wrt the C
> locale (unless the user specified otherwise) this problem goes away.

Constants are supposed to be, well, constant; but the lexical locale
settings and dynamic $LANG settings are not. So upgrading a constant
is fraught with danger anyway.

Even harsher is the reality that there *is* no legal conversion of
high-bit characters to UTF-8 in the C locale, because the C locale
fails to provide such characters any specific meaning.

Conversion of high-bit characters to UTF-8 in the C locale should
never happen automatically, and when requested by the user they should
fail. Or so it appears to me.

> Which will revert things so that you can upgrade a string constant from
> bytes to utf8.

Er, this has to be bad.

Rafael Garcia-Suarez

unread,

Mar 10, 2004, 6:32:23 PM3/10/04

to Chip Salzenberg, Nicholas Clark, Nick Ing-Simmons, perl5-...@perl.org

Chip Salzenberg wrote:
>
> According to Rafael Garcia-Suarez:
> > As Chip pointed out the current conversion does not the right thing
> > for upper bytes. IIUC if we convert properly bytes to UTF8 wrt the C
> > locale (unless the user specified otherwise) this problem goes away.
>
> Constants are supposed to be, well, constant; but the lexical locale
> settings and dynamic $LANG settings are not. So upgrading a constant
> is fraught with danger anyway.

Right. So temporaries copies are needed. And an API for XS programmers
who will need to deal with it.

(Perhaps should we introduce a new flag, "this PV is binary data, never
upgrade it to UTF8", for the new strict behaviour, versus the old lax
behaviour "this PV is upgradable to UTF8 (and if it contains
high-bit chars they'll be treated as Latin-1 and you're on your own)".)

> Even harsher is the reality that there *is* no legal conversion of
> high-bit characters to UTF-8 in the C locale, because the C locale
> fails to provide such characters any specific meaning.

That's horribly correct.

> Conversion of high-bit characters to UTF-8 in the C locale should
> never happen automatically, and when requested by the user they should
> fail. Or so it appears to me.

That's quite a big change. Probably not suitable for maint.

> > Which will revert things so that you can upgrade a string constant from
> > bytes to utf8.
>
> Er, this has to be bad.

In this case I prefer old bugs to new ones.

H.Merijn Brand

unread,

Mar 11, 2004, 3:06:34 AM3/11/04

to Chip Salzenberg, Perl 5 Porters

On Wed 10 Mar 2004 20:10, Chip Salzenberg <ch...@pobox.com> wrote:
> According to H.Merijn Brand:
>
> Statement 1:
>
> > I, using Unicode on a regular basis, only in the UTF-8 encoding, expect
> > 0x20..0xFE to be iso-8859-1 and everything else to be UTF-8.
>
> Statement 2:
>
> > I don't expect \w to match ? or ? or any diacritiacal mark outside of the
> > iso-8859-1.
>
> Statement 3:
>
> > For me \w being [A-Za-z_\d] is FINE. And I don't expect more in perl5.
>
> Statements 1, 2, and 3 are at least partially contradictory.
> For example, \w of [A-Za-z_\d] is "C" locale, not 8859-1.

Then I was not verbose enough. \w is currently well documented to say what it
matches. I don't expect to match beyond that, not even in the future (of perl5)

If - by chance - \w would match á (a-acute), that would probably more surprise
me than I would expect it.

I still see no contradiction.

My statements are about how I perceive things. My expectations, how I /see/
things. How I /expect/ things. They are not about the truth. Expectations are
often broken, and I have to use other ways to get where I want to go, so far
this has not changed my view.

Nick Ing-Simmons

unread,

Mar 11, 2004, 5:42:32 AM3/11/04

to h.m....@hccnet.nl, Chip Salzenberg, Perl 5 Porters

H.Merijn Brand <h.m....@hccnet.nl> writes:
>
>Who is sorry for opening this can of 0x80 worms

I'm not sorry you did.

>
>
>I, using Unicode on a regular basis, only in the UTF-8 encoding, expect
>0x20..0xFE to be iso-8859-1 and everything else to be UTF-8. I know that this
>is pretty shortsighted, but it keeps me alive and healthy.
>
>perl -C2
>
>will make my output do UTF-8, and
>
>use charnames ":full", ":alias" => ...
>
>will make my life easy
>
>I don't expect \w to match á or ø or any diacritiacal mark outside of the
>iso-8859-1. For me \w being [A-Za-z_\d] is FINE. And I don't expect more in
>perl5. If I do, I will use encoding and such.

I agree with all that (in a shared iso-8859-1 view).
I wrote \w as [A-Za-z_0-9] as I have a feeling \d matches
other (arabic etc.) So digits. "١٢٠" i.e. "\x{0661}\x{0662}\x{0660}"
(120) would match \d+.

If I have a worry about "words" it is that \b will be less intuitive
if we stick with C locale word chars.

>
>Just my € 0.02

H.Merijn Brand

unread,

Mar 11, 2004, 5:51:59 AM3/11/04

to Nick Ing-Simmons, Perl 5 Porters

On Thu 11 Mar 2004 11:42, Nick Ing-Simmons <nick.ing...@elixent.com> wrote:
> H.Merijn Brand <h.m....@hccnet.nl> writes:
> >
> >Who is sorry for opening this can of 0x80 worms
>
> I'm not sorry you did.
>
> >
> >
> >I, using Unicode on a regular basis, only in the UTF-8 encoding, expect
> >0x20..0xFE to be iso-8859-1 and everything else to be UTF-8. I know that this
> >is pretty shortsighted, but it keeps me alive and healthy.
> >
> >perl -C2
> >
> >will make my output do UTF-8, and
> >
> >use charnames ":full", ":alias" => ...
> >
> >will make my life easy
> >

> >I don't expect \w to match Ã¡ or Ã¸ or any diacritiacal mark outside of the

> >iso-8859-1. For me \w being [A-Za-z_\d] is FINE. And I don't expect more in
> >perl5. If I do, I will use encoding and such.
>
> I agree with all that (in a shared iso-8859-1 view).
> I wrote \w as [A-Za-z_0-9] as I have a feeling \d matches

> other (arabic etc.) So digits. "Ù¡Ù¢Ù " i.e. "\x{0661}\x{0662}\x{0660}"
> (120) would match \d+.

I will change /my/ view in this to match yours. I agree that \b (now defined
as m{ (?<=\W|^) (?:=\w) | (?<=\w) (?:=\W|$) }x if I'm not mistaken) and \d
could be a lot more difficult

> If I have a worry about "words" it is that \b will be less intuitive
> if we stick with C locale word chars.

Ouch. Right.

Nick Ing-Simmons

unread,

Mar 11, 2004, 5:52:13 AM3/11/04

to ch...@pobox.com, H.Merijn Brand, Perl 5 Porters

Chip Salzenberg <ch...@pobox.com> writes:
>According to H.Merijn Brand:
>
>Statement 1:
>
>> I, using Unicode on a regular basis, only in the UTF-8 encoding, expect
>> 0x20..0xFE to be iso-8859-1 and everything else to be UTF-8.
>
>Statement 2:
>
>> I don't expect \w to match ? or ? or any diacritiacal mark outside of the
>> iso-8859-1.
>
>Statement 3:
>
>> For me \w being [A-Za-z_\d] is FINE. And I don't expect more in perl5.
>
>Statements 1, 2, and 3 are at least partially contradictory.

So is fact that

$a = "ñ";
print "no" if $a =~ /^w/;
$a .= chr(0x600);
print "yes" if $a =~ /^w/;

>For example, \w of [A-Za-z_\d] is "C" locale, not 8859-1.

And Merijn and I at least don't mind that.
i.e. appending above would not make ñ a word char.

\w is mainly used for
programming language identifier names, file paths etc. which
are mostly defined in a C locale way. For natural language words
\w alone is "too weak" in that one would want accents, in _my_
case certainly '-' but possibly also (e.g. "e.g.") '.' but perhaps
not 0-9 and almost certainly not '_'.
So I tend not to use \w when matching kind of "words" that locale
affects.

Nick Ing-Simmons

unread,

Mar 11, 2004, 6:00:48 AM3/11/04

to ch...@pobox.com, Nicholas Clark, Nick Ing-Simmons, perl5-...@perl.org, Rafael Garcia-Suarez

Chip Salzenberg <ch...@pobox.com> writes:
>According to Rafael Garcia-Suarez:
>> As Chip pointed out the current conversion does not the right thing
>> for upper bytes. IIUC if we convert properly bytes to UTF8 wrt the C
>> locale (unless the user specified otherwise) this problem goes away.
>
>Constants are supposed to be, well, constant; but the lexical locale
>settings and dynamic $LANG settings are not. So upgrading a constant
>is fraught with danger anyway.
>
>Even harsher is the reality that there *is* no legal conversion of
>high-bit characters to UTF-8 in the C locale, because the C locale
>fails to provide such characters any specific meaning.

Which is to say there are no high bit _characters_ in C locale,
only 0x00..0x7f exist.

>
>Conversion of high-bit characters to UTF-8 in the C locale should
>never happen automatically, and when requested by the user they should
>fail. Or so it appears to me.
>
>> Which will revert things so that you can upgrade a string constant from
>> bytes to utf8.
>
>Er, this has to be bad.

Not in an iso-8859-1 or other locale with high bits.
If you know the locale then the transform is legal and does
not change the meaning.

I _think_ I am coming round to a compromise
position that if we are in the C locale (or en_US ?)
then unless string is /^[\x00-\x7f]*$/ then it should complain.
In an 8-bit locale it should convert using the locale...

Regardless of that I would be happy for \w to be C locale one.

h...@crypt.org

unread,

Mar 11, 2004, 9:25:55 AM3/11/04

to perl5-...@perl.org

Nicholas Clark <ni...@ccl4.org> wrote:
:For the moment, should I apply this:

:
:==== //depot/perl/sv.c#725 - /home/nick/p4perl/perl/sv.c ====
:@@ -3477,10 +3477,6 @@
: sv_force_normal_flags(sv, 0);
: }
:
:- if (SvREADONLY(sv)) {
:- Perl_croak(aTHX_ PL_no_modify);
:- }
:-
: if (PL_encoding && !(flags & SV_UTF8_NO_ENCODING))
: sv_recode_to_utf8(sv, PL_encoding);
: else { /* Assume Latin-1/EBCDIC */
:
:Which will revert things so that you can upgrade a string constant from
:bytes to utf8. (which is how things were before I started faffing)

I'd say it is suitable for bleed, on the assumption that it _ought_ to
be legitimate, and if it exposes bugs elsewhere we need to fix them.

For maint it is more dubious, because as this discussion shows it _is_
likely to expose bugs elsewhere.

And of course we may conclude that it either is not legitimate, or that
it's the right thing to do but causes too many problems, and revert it. :)

Hugo

Chip Salzenberg

unread,

Mar 11, 2004, 9:54:25 AM3/11/04

to H.Merijn Brand, Perl 5 Porters

According to H.Merijn Brand:

> \w is currently well documented to say what it matches. I don't

> expect to match beyond that, not even in the future (of perl5). If
> - by chance - \w would match ? (a-acute), that would probably more

> surprise me than I would expect it.

Surprise!

(1) Since 5.4, Perl has been capable of having \w match things outside
[_0-9a-zA-Z], but you have to say C<use locale> and most people
don't do that. More to the point, C<use locale> can't break
existing code, because it's lexically scoped.

(2) More pressingly, as I demonstrated with my N-with-tilde example,
converting "\xB1" to UTF-8 makes it match /\w/ because the
character classes are interpreted in Unicode style when applied to
UTF-8 strings.

So the question isn't whether \w should be expanded. It *has* been
expanded. The question now is how to limit the damage caused by the
UTF-8 conversions that let it happen in case (2).

Chip Salzenberg

unread,

Mar 11, 2004, 9:57:34 AM3/11/04

to Nick Ing-Simmons, H.Merijn Brand, Perl 5 Porters

According to Nick Ing-Simmons:

> So I tend not to use \w when matching kind of "words" that locale
> affects.

OK, that's good for you, you're less likely to be bitten by the bug.
But that doesn't make the problem go away. But there are people who
use \w for e.g. untainting input to be used in filenames or other
purposes. Are we going to "fix" all that code? All we can fix is Perl.

Nick Ing-Simmons

unread,

Mar 11, 2004, 10:09:44 AM3/11/04

to h...@crypt.org, ni...@ccl4.org, perl5-...@perl.org

<h...@crypt.org> writes:
>Nicholas Clark <ni...@ccl4.org> wrote:
>:For the moment, should I apply this:
>:
>:==== //depot/perl/sv.c#725 - /home/nick/p4perl/perl/sv.c ====
>:@@ -3477,10 +3477,6 @@
>: sv_force_normal_flags(sv, 0);
>: }
>:
>:- if (SvREADONLY(sv)) {
>:- Perl_croak(aTHX_ PL_no_modify);
>:- }
>:-
>: if (PL_encoding && !(flags & SV_UTF8_NO_ENCODING))
>: sv_recode_to_utf8(sv, PL_encoding);
>: else { /* Assume Latin-1/EBCDIC */
>:
>:Which will revert things so that you can upgrade a string constant from
>:bytes to utf8. (which is how things were before I started faffing)

My vote goes to reverting before the faffing, because:

//depot/maint-5.8/perl/...

nick@llama:/home/p4work/perl/maint-5.8> ./perl ~/Tku/tktiny
Modification of a read-only value attempted at /usr/local/perl/lib/site_perl/5.8.3/i686-linux-multi/Tk/Submethods.pm line 37.
nick@llama:/home/p4work/perl/maint-5.8>

Where tktiny is my simplest "hello world" Tk app:

#!/usr/local/bin/perl -w
use strict;
use Tk;
my $mw = MainWindow->new();
$mw->Button(-text => 'Quit', -command => [destroy => $mw])->pack;
MainLoop;
__END__

This barfs as above when Tk asks for UTF-8 of 'Quit'.
Now that happens to be invariant but it still dies.

A nearby hunk of code seems to be the root of the other Tk issue:

if (!SvPOK(sv)) {
STRLEN len = 0;
(void) sv_2pv_flags(sv,&len, flags);
if (!SvPOK(sv))
return len;
}

This behaves oddly if sv is a PVLV (e.g. substr) or other magical
which sets SvPOKp but not SvPOK.
(But I have a work round for that already.)

Rafael Garcia-Suarez

unread,

Mar 11, 2004, 10:17:21 AM3/11/04

to Nick Ing-Simmons, h...@crypt.org, ni...@ccl4.org, perl5-...@perl.org

Quoting Nick Ing-Simmons <nick.ing...@elixent.com>:
> My vote goes to reverting before the faffing, because:

Nicholas already reverted blead.

Change 22483 by nicholas@penfold on 2004/03/10 20:38:49

croaking for readonly SVs in Perl_sv_utf8_upgrade_flags was a mistake
back this out until we have a tangible policy

Affected files ...

... //depot/perl/sv.c#726 edit

Nick Ing-Simmons

unread,

Mar 11, 2004, 10:26:04 AM3/11/04

to ch...@pobox.com, H.Merijn Brand, Perl 5 Porters

Chip Salzenberg <ch...@pobox.com> writes:
>According to H.Merijn Brand:
>> \w is currently well documented to say what it matches. I don't
>> expect to match beyond that, not even in the future (of perl5). If
>> - by chance - \w would match ? (a-acute), that would probably more
>> surprise me than I would expect it.
>
>Surprise!
>
> (1) Since 5.4, Perl has been capable of having \w match things outside
> [_0-9a-zA-Z], but you have to say C<use locale> and most people
> don't do that. More to the point, C<use locale> can't break
> existing code, because it's lexically scoped.

Which is fine.

>
> (2) More pressingly, as I demonstrated with my N-with-tilde example,
> converting "\xB1" to UTF-8 makes it match /\w/ because the
> character classes are interpreted in Unicode style when applied to
> UTF-8 strings.

And I agree the lack of consistency is a bug.

So it shouldn't do that unless you say (lexically scoped) C<use unicode> ?
(C<use ctype 'unicode'> ?)

Back at 5.6 there was a C<use utf8> which had a whole pile of semantics
dumped on it - one of those was this one I guess.
Snag here is how to be binary compatible while adding a flag bit.

IMHO this property is one of current scope, and not a feature of how
the string happens to be represented.

>
>So the question isn't whether \w should be expanded. It *has* been
>expanded.

Not without a pragma?

Chip Salzenberg

unread,

Mar 11, 2004, 10:47:41 AM3/11/04

to Nick Ing-Simmons, H.Merijn Brand, Perl 5 Porters

According to Nick Ing-Simmons:
> Chip Salzenberg <ch...@pobox.com> writes:

> > (2) More pressingly, as I demonstrated with my N-with-tilde example,
> > converting "\xB1" to UTF-8 makes it match /\w/ because the
> > character classes are interpreted in Unicode style when applied to
> > UTF-8 strings.
>
> And I agree the lack of consistency is a bug.

How far back does this go?

> So it shouldn't do that unless you say (lexically scoped) C<use unicode> ?
> (C<use ctype 'unicode'> ?)

That would be safe, sure.

It would be cool to let /\w/ keep its old C-locale meaning and let
/\p{IsWord}/ match UTF-8 \xB1 even without a pragma.

OTOH, it looks like the docs say that /\w/ is a synonym
for both /[[:word:]]/ and /\p{IsWord}/.

OTGH, people who write /\p{IsWord}/ are usually thinking in Unicode
terms, so perhaps it would OK for the \p{} forms to always match in
the Unicode style, even without a pragma. It'll have to be in large
friendly letters in perldelta, of course....

> IMHO this property is one of current scope, and not a feature of how
> the string happens to be represented.

I think so.

> >So the question isn't whether \w should be expanded. It *has* been
> >expanded.
>
> Not without a pragma?

It has been expanded without a pragma by the automatic application of
Unicode semantics to to UTF-8 strings. Which is where we started....

H.Merijn Brand

unread,

Mar 11, 2004, 10:52:36 AM3/11/04

to Chip Salzenberg, Perl 5 Porters

On Thu 11 Mar 2004 16:47, Chip Salzenberg <ch...@pobox.com> wrote:
> According to Nick Ing-Simmons:
> > Chip Salzenberg <ch...@pobox.com> writes:
> > > (2) More pressingly, as I demonstrated with my N-with-tilde example,
> > > converting "\xB1" to UTF-8 makes it match /\w/ because the
> > > character classes are interpreted in Unicode style when applied to
> > > UTF-8 strings.
> >
> > And I agree the lack of consistency is a bug.
>
> How far back does this go?
>
> > So it shouldn't do that unless you say (lexically scoped) C<use unicode> ?
> > (C<use ctype 'unicode'> ?)
>
> That would be safe, sure.
>
> It would be cool to let /\w/ keep its old C-locale meaning and let
> /\p{IsWord}/ match UTF-8 \xB1 even without a pragma.
>
> OTOH, it looks like the docs say that /\w/ is a synonym
> for both /[[:word:]]/ and /\p{IsWord}/.

Changing docs is the easiest fix :]
What does the Camel say?

> OTGH, people who write /\p{IsWord}/ are usually thinking in Unicode
> terms, so perhaps it would OK for the \p{} forms to always match in
> the Unicode style, even without a pragma. It'll have to be in large
> friendly letters in perldelta, of course....

Same: easy fix. And I agree

> > IMHO this property is one of current scope, and not a feature of how
> > the string happens to be represented.
>
> I think so.
>
> > >So the question isn't whether \w should be expanded. It *has* been
> > >expanded.
> >
> > Not without a pragma?
>
> It has been expanded without a pragma by the automatic application of
> Unicode semantics to to UTF-8 strings. Which is where we started....

--

Chip Salzenberg

unread,

Mar 11, 2004, 10:54:47 AM3/11/04

to Nick Ing-Simmons, H.Merijn Brand, Perl 5 Porters

Sorry to follow up to myself, but I just realized that all my talk of
/\w/ might be misunderstood as if that were the only issue.

*All* of the character class regexes (/\d/, /\s/, etc.) and
character-oriented operators (uc, lc, etc.) are involved here if they
change their behavior spontaneously when presented with UTF-8 strings.

If conversions were done in C locale instead of Latin-1 this wouldn't
be so much of an issue. It might even be totally ignorable. I dunno.

Nick Ing-Simmons

unread,

Mar 11, 2004, 11:22:18 AM3/11/04

to ch...@pobox.com, H.Merijn Brand, Nick Ing-Simmons, Perl 5 Porters

Chip Salzenberg <ch...@pobox.com> writes:
>According to Nick Ing-Simmons:
>> So I tend not to use \w when matching kind of "words" that locale
>> affects.
>
>OK, that's good for you, you're less likely to be bitten by the bug.
>But that doesn't make the problem go away.

Agreed.

>But there are people who
>use \w for e.g. untainting input to be used in filenames or other
>purposes. Are we going to "fix" all that code?

Obviously not.

>All we can fix is Perl.

Also agreed. But before we can fix it we need to decide what is correct.
And while perl6 can (and should) set new standards of sane correct-ness
Correct-ness for perl5 needs to be judged in the light of what it breaks.

What my comments about \w => [A-Za-z0-9_] were about, was stating that
in the absence of C<use locale> / C<use utf8> / C<use encoding ...>
then [A-Za-z0-9_] is what perl1.0 .. perl5.5 meant, and that rule should be
used even if string happens to be in UTF-8 (and this is easy to
do as ASCII is a subset of UTF-8 - I am ducking the EBCDIC case for
now but NATIVE_IS_INVARIANT should be correct there too).

If C<use locale> is in effect then presumably we should treat any codepoint
which does not back convert to locale's encoding as non-word.
So U+00D1 is NOT word in 8859-2 (where it can't be encoded).
But this is tricky to implement.

C<use encoding> muddles things further - just because the script
is written in 8859-2 doesn't tell us explcitly what central european
author meant \w to mean. However as pragma causes Encode to convert
from named encoding to Unicode then all existing scripts that use
it will have had Unicode semantics of \w (I think).

C<use utf8> gives me a headache, but as I recall where we left it
it is now (roughly?) equivalent to C<use encoding 'utf8'> and so
presumably \w should have Unicode semantics.

In principle orthogonal to all the above is the XS API.
If XS code asks for SvPVutf8 that is what it should get (and as Unicode
is supposedly universal it should never fail ;-)).

The opposite request is slighty more problematic SvPVbyte has always assumed
U+0000..U+00FF i.e. 8859-1 (or its EBCDIC cousin). We don't have an API
call to get back the locale encoded bytes that may have been "accidentally"
UTF-8 encoded - but explicit use of Encode is possible.

Nick Ing-Simmons

unread,

Mar 11, 2004, 11:38:24 AM3/11/04

to h.m....@hccnet.nl, Chip Salzenberg, Perl 5 Porters

H.Merijn Brand <h.m....@hccnet.nl> writes:
>On Thu 11 Mar 2004 16:47, Chip Salzenberg <ch...@pobox.com> wrote:
>> According to Nick Ing-Simmons:
>> > Chip Salzenberg <ch...@pobox.com> writes:
>> > > (2) More pressingly, as I demonstrated with my N-with-tilde example,
>> > > converting "\xB1" to UTF-8 makes it match /\w/ because the
>> > > character classes are interpreted in Unicode style when applied to
>> > > UTF-8 strings.
>> >
>> > And I agree the lack of consistency is a bug.
>>
>> How far back does this go?

perl5.6 I guess - but that was in hind sight a mess.
perl5.8 cleanup some of the mess, 5.8.1 some more and now it
looks like 5.8.4 will clean up this one?

>>
>> > So it shouldn't do that unless you say (lexically scoped) C<use unicode> ?
>> > (C<use ctype 'unicode'> ?)
>>
>> That would be safe, sure.
>>
>> It would be cool to let /\w/ keep its old C-locale meaning and let
>> /\p{IsWord}/ match UTF-8 \xB1 even without a pragma.

I proposed that (though not in quite those words) the other day i.e.
\w (and all related thingies) keep their C (or C<use locale>?) meanings.
If you want Unicode semantics then use \p{}.

BUT - IMHO the semantics should be independant of how the string is
represented. If we can get to that point, then perl core (or XS code)
is free to flip back and forth for (space/time) efficiency reasons.
If we can't agree this then we need to re-think SvPVutf8 API.

>>
>> OTOH, it looks like the docs say that /\w/ is a synonym
>> for both /[[:word:]]/ and /\p{IsWord}/.
>
>Changing docs is the easiest fix :]

Well we need to change code (a little) too - current code is broken.

>What does the Camel say?

Will try and find a copy ... it probably contradicts itself ;-)

>
>> OTGH, people who write /\p{IsWord}/ are usually thinking in Unicode
>> terms, so perhaps it would OK for the \p{} forms to always match in
>> the Unicode style, even without a pragma. It'll have to be in large
>> friendly letters in perldelta, of course....
>
>Same: easy fix. And I agree
>
>> > IMHO this property is one of current scope, and not a feature of how
>> > the string happens to be represented.
>>
>> I think so.
>>
>> > >So the question isn't whether \w should be expanded. It *has* been
>> > >expanded.
>> >
>> > Not without a pragma?
>>
>> It has been expanded without a pragma by the automatic application of
>> Unicode semantics to to UTF-8 strings. Which is where we started....

Which surely we all agree is a bug?

Yitzchak Scott-Thoennes

unread,

Mar 11, 2004, 12:17:30 PM3/11/04

to Nick Ing-Simmons, h.m....@hccnet.nl, Chip Salzenberg, Perl 5 Porters, j...@iki.fi

No. Once something reencoded into utf8, they are unicode characters.
The current scope's locale/encoding no longer applies. Breaking that
would be worse than how it is now. Even providing a pragma to decide
how utf8-encoded code points less than 256 are treated leads to
horrible inconsistency.

Jarkko worked very hard to get things to where they are now; please don't
jump in too hastily to fix things, especially without his input.

I do agree that the upgrade changing \w et al is a bug, but I'm not
sure it's fixable. I have a lot of trouble believing the argument
that detainted could become less effective.

Seems to me we might have a (suppressable) warning at the time of
upgrade if there are characters whose classes would change.

Chip Salzenberg

unread,

Mar 11, 2004, 12:44:29 PM3/11/04

to Yitzchak Scott-Thoennes, Nick Ing-Simmons, h.m....@hccnet.nl, Perl 5 Porters, j...@iki.fi

According to Yitzchak Scott-Thoennes:

> Once something reencoded into utf8, they are unicode characters.
> The current scope's locale/encoding no longer applies. Breaking that
> would be worse than how it is now.

That's a reasonable position, I think.

> I do agree that the upgrade changing \w et al is a bug, but I'm not
> sure it's fixable.

I suppose the first step is determining what should happen, precisely,
when bytes meet UTF-8 (either through combining or through built-in
operations that imply the need for UTF-8). Then we can see whether
making that behavior happen is feasible.

> I have a lot of trouble believing the argument that detainted could
> become less effective.

Used tainting much?

Nick Ing-Simmons

unread,

Mar 11, 2004, 12:45:35 PM3/11/04

to nick.ing...@elixent.com, Chip Salzenberg, h.m....@hccnet.nl, Perl 5 Porters

Nick Ing-Simmons <nick.ing...@elixent.com> writes:
>H.Merijn Brand <h.m....@hccnet.nl> writes:
>>On Thu 11 Mar 2004 16:47, Chip Salzenberg <ch...@pobox.com> wrote:
>>>
>>> It would be cool to let /\w/ keep its old C-locale meaning and let
>>> /\p{IsWord}/ match UTF-8 \xB1 even without a pragma.
>
>>>

>>> OTOH, it looks like the docs say that /\w/ is a synonym
>>> for both /[[:word:]]/ and /\p{IsWord}/.
>>
>>Changing docs is the easiest fix :]
>
>Well we need to change code (a little) too - current code is broken.
>
>>What does the Camel say?

Camel 3rd Edition:

p167 It talks about \w "As Bytes" and "As utf8"
"To keep the old byte meanings you can always C<use bytes>".
(Note that C<use bytes> once upon a time had horrible side effects.)

As Bytes has \w as [a-zA-Z0-9_]
As utf8 has \w as \p{IsWord}
- as of 5.6.0 you need to C<use utf8> for these properties to work.
This restriction will be relaxed in future.

\b is defined in terms of \w\W

But on the other hand p409 says "Caution Men working" explicitly
says that:
- regular expression code isn't polymorphic and needs changing
- use of locales with utf8 may lead to odd results.

Now in 5.8 they are polymorphic but have the bug we are dicussing.

Nick Ing-Simmons

unread,

Mar 11, 2004, 12:47:40 PM3/11/04

to rgarci...@free.fr, ni...@ccl4.org, Nick Ing-Simmons, h...@crypt.org, perl5-...@perl.org

Rafael Garcia-Suarez <rgarci...@free.fr> writes:
>Quoting Nick Ing-Simmons <nick.ing...@elixent.com>:
>> My vote goes to reverting before the faffing, because:
>
>Nicholas already reverted blead.
>
>Change 22483 by nicholas@penfold on 2004/03/10 20:38:49
>
> croaking for readonly SVs in Perl_sv_utf8_upgrade_flags was a mistake
> back this out until we have a tangible policy

But 'tis //depot/maint-5.8/perl/sv.c (#90)

that kills Tk.

Nick Ing-Simmons

unread,

Mar 11, 2004, 1:08:32 PM3/11/04

to stho...@efn.org, Chip Salzenberg, j...@iki.fi, h.m....@hccnet.nl, Nick Ing-Simmons, Perl 5 Porters

Yitzchak Scott-Thoennes <stho...@efn.org> writes:
>> >> It has been expanded without a pragma by the automatic application of
>> >> Unicode semantics to to UTF-8 strings. Which is where we started....
>>
>> Which surely we all agree is a bug?
>
>No. Once something reencoded into utf8, they are unicode characters.

Even if you didn't ask for it?

{use locale;

$string = "\xff"; # not a valid word char in ascii or locale

# call out into somewhere else e.g.
$tktext->append($string) or send $utf8_fh $string;
# because some other code has touched it $string is now SvUTF8,
# because locale is lexical scoped 0xff is now U+00FF as
# other place defaulted to Latin1 when upgrade happened

if ($string =~ /^\w+$/)
{
# so we get here
}

}

>The current scope's locale/encoding no longer applies. Breaking that
>would be worse than how it is now.

>Even providing a pragma to decide
>how utf8-encoded code points less than 256 are treated leads to
>horrible inconsistency.

So does current behaviour.

>
>Jarkko worked very hard to get things to where they are now; please don't
>jump in too hastily to fix things, especially without his input.

Fine.

>
>I do agree that the upgrade changing \w et al is a bug, but I'm not
>sure it's fixable. I have a lot of trouble believing the argument
>that detainted could become less effective.
>
>Seems to me we might have a (suppressable) warning at the time of
>upgrade if there are characters whose classes would change.

The snag is they might not change at the point of upgrade.
If Tk/Text.pm (no locale or encoding in scope) does upgrade then
then what do we check against? Suppose string is 'Quit' - nothing odd
there. Now append \xff - string is Unicode now so that appends U+00FF.
Return to callers code we now have a surprising wordy 'ÿ'

Nick Ing-Simmons

unread,

Mar 12, 2004, 3:21:24 AM3/12/04

to ni...@ccl4.org, h...@crypt.org, perl5-...@perl.org, rgarci...@free.fr

Nick Ing-Simmons <nick.ing...@elixent.com> writes:
>Rafael Garcia-Suarez <rgarci...@free.fr> writes:
>>Quoting Nick Ing-Simmons <nick.ing...@elixent.com>:
>>> My vote goes to reverting before the faffing, because:
>>
>>Nicholas already reverted blead.
>>
>>Change 22483 by nicholas@penfold on 2004/03/10 20:38:49
>>
>> croaking for readonly SVs in Perl_sv_utf8_upgrade_flags was a mistake
>> back this out until we have a tangible policy
>
>But 'tis //depot/maint-5.8/perl/sv.c (#90)
>

//depot/maint-5.8/perl/...

runs Tk804.025_beta16+ again now - thanks.

Nicholas Clark

unread,

Mar 12, 2004, 4:33:46 PM3/12/04

to H.Merijn Brand, Chip Salzenberg, Perl 5 Porters

On Wed, Mar 10, 2004 at 07:34:16PM +0100, H.Merijn Brand wrote:
> On Wed 10 Mar 2004 18:00, Chip Salzenberg <ch...@pobox.com> wrote:
> > > >It gets worse if there are *any* circumstances in which $nn might be
> > > >converted to UTF-8 _in-place_. Please tell me there aren't.
> > >
> > > This is the current snag - there are spots in core perl which do that.
> >
> > <a href="chip_screaming.mp3>Great.</a>
>
> s/(?<=3)/\N{QUOTATION MARK}/;

>
> Who is sorry for opening this can of 0x80 worms
>
>

> I, using Unicode on a regular basis, only in the UTF-8 encoding, expect

> 0x20..0xFE to be iso-8859-1 and everything else to be UTF-8. I know that this
> is pretty shortsighted, but it keeps me alive and healthy.

> Just my € 0.02
!

Mmm. 0x80. An 0x80 worm. That's not a valid printable character in ISO-8859-1

(and your headers agree with the comment in the message body:
Content-Type: text/plain; charset="ISO-8859-1"
)

That was somewhat pointless MIME pedantry. However, the important part is
that conversion from bytes to Unicode is impossible without locale.
And I know you've told me before which extension to ISO-8859-1 maps the
euro symbol to 0x80, but it does mean that I need to remember that before
I can convert your message properly.

Which I feel proves Chip's point about implicit from-byte conversions.

Nicholas Clark

unread,

Mar 12, 2004, 5:25:52 PM3/12/04

to Nick Ing-Simmons, h...@crypt.org, perl5-...@perl.org, rgarci...@free.fr

Sorry about the delay - I didn't remember that I hadn't rolled everything
over to maint, and then once I did it last night I got diverted onto
mouse catching, and then didn't have time to deal with this thread.
(or more accurately "attempted mouse catching" - my success in the
kitchen a couple of weeks ago was not repeated last night. Maybe I am not
a cat substitute)

Nicholas Clark

Chip Salzenberg

unread,

Mar 12, 2004, 5:41:55 PM3/12/04

to Nick Ing-Simmons, Nicholas Clark, perl5-...@perl.org, Rafael Garcia-Suarez

According to Nick Ing-Simmons:
> Chip Salzenberg <ch...@pobox.com> writes:

> >Even harsher is the reality that there *is* no legal conversion of
> >high-bit characters to UTF-8 in the C locale, because the C locale
> >fails to provide such characters any specific meaning.
>
> Which is to say there are no high bit _characters_ in C locale,
> only 0x00..0x7f exist.

That's better; it works for both translations (C <=> Unicode).

> >> Which will revert things so that you can upgrade a string constant from
> >> bytes to utf8.
> >
> >Er, this has to be bad.
>
> Not in an iso-8859-1 or other locale with high bits. If you know
> the locale then the transform is legal and does not change the
> meaning.

But the upgrade persists and becomes visible outside the scope of the
locale setting. (Once we *have* a locale setting that affects UTF-8
translations, which we don't yet. (Do we?))

> I _think_ I am coming round to a compromise position that if we are
> in the C locale (or en_US ?) then unless string is /^[\x00-\x7f]*$/
> then it should complain. In an 8-bit locale it should convert using
> the locale...

I agree with this, except for leakage problems if the results of the
conversion escape the scope(s) of the relevant pragma(s).

> Regardless of that I would be happy for \w to be C locale one.

I perceive that this is a question we can resolve separately.

Nick Ing-Simmons

unread,

Mar 15, 2004, 6:39:37 AM3/15/04

to ch...@pobox.com, Nicholas Clark, Nick Ing-Simmons, Rafael Garcia-Suarez, perl5-...@perl.org

Chip Salzenberg <ch...@pobox.com> writes:
>According to Nick Ing-Simmons:
>> Chip Salzenberg <ch...@pobox.com> writes:
>> >Even harsher is the reality that there *is* no legal conversion of
>> >high-bit characters to UTF-8 in the C locale, because the C locale
>> >fails to provide such characters any specific meaning.
>>
>> Which is to say there are no high bit _characters_ in C locale,
>> only 0x00..0x7f exist.
>
>That's better; it works for both translations (C <=> Unicode).
>
>> >> Which will revert things so that you can upgrade a string constant from
>> >> bytes to utf8.
>> >
>> >Er, this has to be bad.
>>
>> Not in an iso-8859-1 or other locale with high bits. If you know
>> the locale then the transform is legal and does not change the
>> meaning.
>
>But the upgrade persists and becomes visible outside the scope of the
>locale setting. (Once we *have* a locale setting that affects UTF-8
>translations, which we don't yet. (Do we?))

From an internals technical point of view I believe that
the upgraded-ness of a string should have no semantic value.
Thus whether the N-th thing in string (say chr(0xff)) is
represented as one byte or the two byte UTF-8 encoding should
not affect that ord(substr($s,N,1)) == 255.
Nor IMHO should it affect whether in a particular scope it matches the
/\w/ meaning in that scope.

However it seems it does have a semantic value at present.

Which means that internals must avoid upgrading which means
a lot of malloc-new-string, utf-8 encode new string,
process in unicode, free new string. In particular as long
as we have this semantic attached to the representation
then SvPVutf8 is problematic - presumably it should malloc
a new value encode to it and return that (and add
a scope-exit SAVE_XXX() to free it again). But even that
is going to surprise things like Tk which (since perl5.6) have
been using the call, and expecting the string to have same liketime
as the SV.

Nicholas Clark

unread,

Mar 15, 2004, 6:49:46 AM3/15/04

to Nick Ing-Simmons, ch...@pobox.com, Rafael Garcia-Suarez, perl5-...@perl.org

On Mon, Mar 15, 2004 at 11:39:37AM +0000, Nick Ing-Simmons wrote:

> Which means that internals must avoid upgrading which means
> a lot of malloc-new-string, utf-8 encode new string,
> process in unicode, free new string. In particular as long
> as we have this semantic attached to the representation
> then SvPVutf8 is problematic - presumably it should malloc
> a new value encode to it and return that (and add
> a scope-exit SAVE_XXX() to free it again). But even that
> is going to surprise things like Tk which (since perl5.6) have
> been using the call, and expecting the string to have same liketime
> as the SV.

If the conversion is cached (somehow) with the SV, and the cached conversion
only discarded when the SV is changed. (Bluurg. Worry about SvPV later)
would this make TK happy? [ie would it meet TK's assumptions]

Nicholas Clark

Nick Ing-Simmons

unread,

Mar 15, 2004, 7:42:59 AM3/15/04

to ni...@ccl4.org, ch...@pobox.com, Nick Ing-Simmons, perl5-...@perl.org, Rafael Garcia-Suarez

That would fix the perl -> Tk direction just fine.
And other things that want UTF-8 for network etc. would also
be happy with that.

But as GUIs pass data both ways it won't solve all Tk problems.

Now most results come back via xsub returns so SVs are Tk's and
will be usually/often/sometimes be SvUTF8_on.
But in some cases like

$entry->configure(-textvariable => \$price);

Tk expects its changes to be visible to user, so there is 'U' magic
that keeps them in step. That can I guess keep it downgraded
unless essential - but that is extra re-re-scaning of the
string which current scheme of "upgrade and forget" avoids.

>
>Nicholas Clark