My comments are as follows.
There is no way to get "it all work". There was no way to "get it all
work". "All" is a big word.
One has to consider backward compatibility (ASCII, "binary", locales)
and all the various aspects of Unicode. The best guiding light I can
give is to always think of I/O (the Perl scripts being a form of I): how
did this data get in, what data do we think it is, and how is the data
going to go out?
A good example of unexpected breakage was the 5.8.0 semantics of locales
possibly turning on UTF-8-ness on STD* streams: a good idea, but the
result was horrible breakage in Redhat systems because they did have
UTF-8 locales for everybody, and their print chr(0xff) started producing
two bytes, not one.
After some careful thought and consideration I must say I can't help
taking it somewhat personally that people are now coming out of the
woodwork and finding things "wrong". I have lately been accused both
privately and publicly, both directly and indirectly, both for keeping
backward compatibility and for not keeping it.
I know what we have has problems, but I am done with it. You can do
whatever you want, change the Unicode semantics and interfaces however
you want. I gave it my best shot.
P.S. Sorry about the format=flawed in the message headers. I've had a
bad email week, on both MUA and MTA fronts.
--
Jarkko Hietaniemi <j...@iki.fi> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'. It is 'dead'." -- Jack Cohen
I appreciate and value what you've done with Perl, Jarkko, and so do
all the sane Perl people. You know that. The Unicode discussion is
not personal. And it's just wrong to suggest that anyone who wasn't
participating at the time of original discussion is not allowed to
complain or get involved in changing it.
I've been in your seat, you know. Most new things in 5.004 were
criticized. Sometimes the critics were right. Whenever I took it
personally, I was out of line, and no one benefitted.
Should I have taken personally the (deserved) jabs at C<*GLOB{FOO}>
and at the complaints about the warning for C<< while ($a = <>) {} >>?
Should I have been upset that the latter issue was eliminated by Larry
"coming out of the woodwork" and redefining the language so that all
my careful compatibility efforts were moot? Of course not. Now it's
better than it was, and that's what counts.
So don't take it personally. We're discussing code, not coders.
--
Chip Salzenberg - a.k.a. - <ch...@pobox.com>
"I wanted to play hopscotch with the impenetrable mystery of existence,
but he stepped in a wormhole and had to go in early." // MST3K
I'm pretty well aware of this. ("All" is a big word but not a long one.
It can be said quickly.)
> One has to consider backward compatibility (ASCII, "binary", locales)
> and all the various aspects of Unicode.
That's why I wanted to keep the old behaviour in 5.9.1, and take time to
think about it. (And I'm still slowly learning about locales and
Unicode.) In short, for language design or redesign, it's next door (the
one with a big "6" on it.)
> The best guiding light I can
> give is to always think of I/O (the Perl scripts being a form of I): how
> did this data get in, what data do we think it is, and how is the data
> going to go out?
>
> A good example of unexpected breakage was the 5.8.0 semantics of locales
> possibly turning on UTF-8-ness on STD* streams: a good idea, but the
> result was horrible breakage in Redhat systems because they did have
> UTF-8 locales for everybody, and their print chr(0xff) started producing
> two bytes, not one.
And finally the new meaning of the -C switch turned out to be useful.
> After some careful thought and consideration I must say I can't help
> taking it somewhat personally that people are now coming out of the
> woodwork and finding things "wrong". I have lately been accused both
> privately and publicly, both directly and indirectly, both for keeping
> backward compatibility and for not keeping it.
The current situation is not that bad. Lots of people use perl 5.8 and
it doesn't explode at every locale corner. Hey, I've even been told
there's some Japanese and Taiwanese people who aren't disgusted by it yet.
=for p5p
The problem that was discussed at length was with the upgrade of bytes
strings that contain high bit chars to UTF-8 strings under the C locale.
I'm coming to think (and I don't think Hugo disagrees) that this is
mostly unfixable. Maybe a new kind of stricture could disallow this
operation, be it a lexical stricture, or a flag on 'strict' byte strings
(binary strings). Or both. Or none. Some technical discussions are
needed.
=cut
> I know what we have has problems, but I am done with it. You can do
> whatever you want, change the Unicode semantics and interfaces however
> you want. I gave it my best shot.
You did a great job.
> According to Jarkko Hietaniemi:
>
>>After some careful thought and consideration I must say I can't help
>>taking it somewhat personally that people are now coming out of the
>>woodwork and finding things "wrong".
We must stop seeing each other like this, Chip.
> And it's just wrong to suggest that anyone who wasn't
> participating at the time of original discussion is not allowed to
> complain or get involved in changing it.
While I agree in principle I am sorry but my throat is still hoarse from
talking these things over and over again.
> So don't take it personally. We're discussing code, not coders.
I don't think it's the case here, really, and that's not why you pushed
my wrong buttons. Discussing _code_ would be easy. We are discussing
behaviour here, and I can tell you the overall behaviour is from hell.
It's not about Perl features. It's a morass of conflicting and
under-specified real-world behaviours and the confusing and confused
expectations people have.
One devious move would be to _ignore_ locales regarding \w et alia for
the 0x80..0xff range and use the Unicode semantics for that whole range.
That way the behaviour would be the same everywhere [1]. (For
performance reasons using the \p{Alpha} code paths should be avoided and
instead simple bitmaps be prepared. Of course, this approach could be
extended to the whole 0x00..0xff range.)
[1] And we would avoid stupidities like e.g. e-grave not being a letter
in a locale (language_country) which doesn't happen to have e-grave in
its 'native' letters.
So the real hell is devising the spec, and that's all about people?
Sure. But that's part of "discussing the code", because coding starts
with the spec of what the code is supposed to do. When I said "code
not coders", I meant we're saying "this is what should happen" and not
"he should be the first one up againt the wall".
[Spec discussion begins here; stop reading if you Really Don't Care]
AFAICT, we've only got two technical issues here. They're both about
making sure that Perl doesn't assume a locale when the user hasn't
told it to. Perl's history establishes beyond doubt that Perl works
in the "C" locale until a switch or pragma says otherwise. I see no
reason why the Unicode support should be an exception.
1. It's obvious that upgrading strings in place without explicit
request is a Bad Thing, and I suspect we can get rid of that
without breaking much (if anything) at the user code level.
2. It's less obvious whether the assumption that all byte strings are
8859-1 for purposes of UTF-8 conversion can be fixed. Still, I
think it's worth trying. Maybe we can reduce the collateral damage
enough to make it OK.
Revolution !!
I am afraid I really have to bow out of this rehashing. I'm starting to
have bad trip flashbacks.
If people have specific questions of the sort "why was X done this way?"
or "what will break if Y is changed" (for the latter the answer should
be easy: just run "make test", and "make utest"), I can answer those
questions privately, but I must fight being dragged back to the time
vortex that is p5p. Sorry.
I really think we need to assume locale "C", not Latin-1, which is
what your suggestion amounts to.
After all, \w isn't just for humans (any more?), it's also for making
code work right. For example, [_[:alpha:]]\w* is a program
identifier, \d+ is a pid, and \s is [ \t\r\n]. And we use patterns
for untainting unsafe data that arrive in CGI query strings....
> According to Jarkko Hietaniemi:
>
>>One devious move would be to _ignore_ locales regarding \w et alia for
>>the 0x80..0xff range and use the Unicode semantics for that whole range.
>
>
> I really think we need to assume locale "C", not Latin-1, which is
> what your suggestion amounts to.
People.
> After all, \w isn't just for humans (any more?), it's also for making
> code work right. For example, [_[:alpha:]]\w* is a program
> identifier,
For you program identifier is that (I assume that you mean
[:alpha:] in POSIX locale sense). For someone else that is
<id> := <id_start> (<id_start> | <id_extend>)*
<id_start> := General Category = L or Nl, or Other_ID_Start
<id_extend> := General Category = Mn, Mc, Nd, Pc, or Cf
In my experience, most of the documents that aren't pure 7bit ASCII
are actually windows cp1252 (a superset of iso-8859-1), courtesy of
all those d*mn MS*ft programs out there. If anything would need to
change, I think Perl should assume cp1252 rather than iso-8859-1.
Sometimes files are even marked as iso-8859-1 while in fact they're
cp1252 (again, probably courtesy of MS*ft programs or otherwise
clueless users).
But I don't think a change is possible anymore: Perl's 5.8.X
behaviour is already deeply entrenched in many applications.
Changing it now would cause all sorts of subtle errors in existing
applications.
>After all, \w isn't just for humans (any more?), it's also for making
>code work right. For example, [_[:alpha:]]\w* is a program
>identifier, \d+ is a pid, and \s is [ \t\r\n]. And we use patterns
>for untainting unsafe data that arrive in CGI query strings....
It would seem to me that Chip's problem is with (un)tainting mainly.
Maybe all of this could be "fixed" by adding a lot of large big
letters to the documentation.
And another thing: why do you need to do a regex to untaint an SV?
Why isn't there an "untaint" function in e.g. Scalar::Util?
I don't think even Perl 6 will be able to do the Right Thing (tm)
with regards to interpreting text of unknown encoding, at least not
until everybody does Unicode in the world. There's just too much
unlabelled and wrongly labelled stuff out there.
I think the current situation should remain with good and clear
documentation about the issues Chip has raised.
I think Jarkko created a good compromise in 5.8.0, which he improved
in 5.8.1. And he definitely doesn't deserve being drawn back into
the vortex. I think Jarkko is an example to all Perl programmers,
from newbies to pumpkings. I can only wish I had the stamina and wit
with which Jarkko was able to deliver 5.8.0 and 5.8.1.
Liz
Yeah, "program identifier" was a poor example. Let's go back to
basics and see where our understandings diverge:
* \w and all the other character class definitions have been
well defined and unchanging since the beginning of Perl time.
> Exceptions have always been marked with pragmas, command options,
UTF-8 strings, or some other marker that indicates "new rules".
* Character classes are used for untainting.
* Untainting is a security feature.
* Breaking security is the worst sin we can commit.
* If we change character classes to not assume "C" locale for byte
strings, we are breaking security.
* Ergo, in the absence of pragmas or other markers, we must assume
"C" locale and ancestral character class definitions for byte
strings or we're breaking security and thus being _very_ evil.
So that's the \w argument. The conversion argument is Something Else.
I should add that the \w argument becomes a lot less urgent if the
conversion argument can be settled in the direction of safety rather
than convenience.
> > Exceptions [to "C" locale] have always been marked with pragmas,
> > command options, UTF-8 strings, or some other marker ...
Please strike "UTF-8 strings" there. UTF-8 strings are not markers,
because they invisibly float in to code code that often wasn't
designed for them. You can't see them, they look like normal strings
... at first! ("This week on 'The Invaders'.")
So: Changes in language rules require lexical markers.
(serves me right for hurrying)
Since my filter that should have moved this thread to trash isn't
seemingly quite working I'm still here...
> * \w and all the other character class definitions have been
> well defined and unchanging since the beginning of Perl time.
>
> > Exceptions have always been marked with pragmas, command options,
> UTF-8 strings, or some other marker that indicates "new rules".
See below.
> * Character classes are used for untainting.
>
> * Untainting is a security feature.
>
> * Breaking security is the worst sin we can commit.
>
> * If we change character classes to not assume "C" locale for byte
> strings, we are breaking security.
I agree about the importance of security, but I think here you are
overreacting. This is what *I* hear you saying:
- if people mix Unicode data and eight-bit data of unknown character
set, the security implications are unknown because the eight-bit data
is interpreted as Latin-1.
And I go: "Huh? What the <bleep> did you expect? You didn't specify
what did you mean by the \xHH, so Perl had to assume _something_ when
it joined that byte with the past-0xFF \x{HHH}."
I chose / we chose the "something" to be "silently assume that those
bytes were Latin-1-ish". I can follow the argument that Perhaps instead
we should croak / warn and e.g. ask people to put an explicit "use
encoding" at the top. That could be enabled by -T and -t.
If your data touches Unicode, IT IS NO MORE THE SAME DATA. (You can
think Unicode as another kind of taint, if that helps :-)
To put it the other way, if people are using Unicode, they should be
aware of that the meanings of \w et alia do change. And yes, I strongly
believe \w [[:alpha:]] \p{Letter} do are the same.
> * Ergo, in the absence of pragmas or other markers, we must assume
> "C" locale and ancestral character class definitions for byte
> strings or we're breaking security and thus being _very_ evil.
Think Unicode as "other marker". You had to open a stream in Unicode
mode / use the encoding pragma / use the utf8 pragma / use the -C switch
/ use the PERL_UNICODE env var / use the \x{HHH} / use the chr(0xHHH) /
use the \N{...}. You can't say that your data _accidentally_ became
Unicode.
If you talk revolution, my favorite thought experiment is:
$_= 'some_high_bit_char_that_is_\w_in_unicode';
$_ .= "some unicode char";
chop;
# Now $_ is the same as before but tainted with unicode
# Conceptually I'd still like it to be the same as the original $_
# and not match \w
This "tainting" is *my* main main dislike about perl unicode semantics.
One way would be to do utf8 as purely an encoding flag without
semantics. And unicode would simply be another locale. People
are used to locales changing the semantics of \w.
Another way is to consider all strings as a sequence of string segments,
where each segment itself is flagged with "unicode" or "not unicode"
(again a flag separate from utf8). Then the above example never looses
track of the fact that that first character was't unicode.
> This "tainting" is *my* main main dislike about perl unicode semantics.
>
> One way would be to do utf8 as purely an encoding flag without
> semantics. And unicode would simply be another locale. People
> are used to locales changing the semantics of \w.
>
> Another way is to consider all strings as a sequence of string segments,
> where each segment itself is flagged with "unicode" or "not unicode"
> (again a flag separate from utf8). Then the above example never looses
> track of the fact that that first character was't unicode.
Having 2 flags available to us (first for encoding=> 8bit/utf8, second for
encoding "bytes" vs "Unicode") would give the flexibility to "solve" all these
problems in any way we liked. (Except for the backwards compatibility issues)
I can't see where a second flag bit is going to come from.[*]
(There's still some mileage left in the current set - eg the UV flag can be
overloaded to mean something else when IVp isn't set, the OOK flag can mean
something else when PVp isn't set, and more similar barf inducing madness)
Nicholas Clark
* aside from either perl6 or ponie.
$_= 'some_high_bit_char_that_is_\w_in_unicode';
$_ .= "some unicode char";
chop;
# Now $_ is the same as before but tainted with unicode
# Conceptually I'd still like it to be the same as the original $_
# and not match \w
With the current perl I think I'd like a pragma that croaks on this kind
of mixing. I may even want it to be dynamically scoped in case some
foreign module does it to a string I passed it.
There's still scope for nicking some bits from SvTYPE; 8 bits are
reserved for it, but only values 0-15 are currently used. This gives 3
spare flag bits and space left over for another 16 types. Reclaiming these
is as simple as modifying SVTYPEMASK AFAIKT.
--
print+qq&$}$"$/$s$,$*${d}$g$s$@$.$q$,$:$.$q$^$,$@$*$~$;$.$q$m&if+map{m,^\d{0\,},,${$::{$'}}=chr($"+=$&||1)}q&10m22,42}6:17*2~2.3@3;^2dg3q/s"&=~m*\d\*.*g
Yes. But there's also the problem of strings getting "upgraded"
_in_place_ by some operations (I don't know what they are, I've just
been told about them). AFAICT, you'd agree that such upgrading
shouldn't happen, and making it go away is a Simple Matter Of
Programming....?
> And I go: "Huh? What the <bleep> did you expect? You didn't specify
> what did you mean by the \xHH, so Perl had to assume _something_ when
> it joined that byte with the past-0xFF \x{HHH}."
Your point is well-taken. My idea that it can assume ASCII
(vs. EBCDIC perhaps) is, I suppose, just a smaller version of this.
> I can follow the argument that Perhaps instead we should croak /
> warn and e.g. ask people to put an explicit "use encoding" at the
> top. That could be enabled by -T and -t.
That's a compromise I could live with.
> If your data touches Unicode, IT IS NO MORE THE SAME DATA. (You can
> think Unicode as another kind of taint, if that helps :-)
It does. :-)
I do remember a lot of fixing of this pattern
data X (byte) and data Y (UTF-8) meet;
data X gets upgraded to UTF-8 even though
no characters of Y "enter" X
during the 5.7-5.8 work. So yes, that is usually a SMOP,
of course at the cost of both space and time.
>>And I go: "Huh? What the <bleep> did you expect? You didn't specify
>>what did you mean by the \xHH, so Perl had to assume _something_ when
>>it joined that byte with the past-0xFF \x{HHH}."
>
>
> Your point is well-taken. My idea that it can assume ASCII
I have no idea what you mean by that. What *do* you propose should
be in $c after this? What sequence of bytes and what sequence of
characters?
$a = chr(0xFF);
$b = chr(0x100);
$c = $a . $b;
Or are you really strongly arguing that the . should croak?
"Me no understandee what you meanee by 0xFF!"
I suggest anybody who is confused by all this study the differences
between ACS (abstract character repertoire), CCS (coded character set)
and encodings of CCSes. In Perl things are even more muddled because
strings double also as _binary_ data (either arrays of number, or
bitvectors)... also, sometimes people think that somehow magically
"\x41" is different from "A" (and how about "\x{41}"? That is certainly
very different, right?)
Without understanding the differences between these, there can be no
understanding of the problems here. There can't be DWIM unless one
keeps clearly in mind at what level one is operating.
Well, I've solved syswrite and send.
Any others should (in my opinion) be regarded as bugs, because they offend
the consistency of behaviour that Jarkko achieved in 5.8.0
> > I can follow the argument that Perhaps instead we should croak /
> > warn and e.g. ask people to put an explicit "use encoding" at the
> > top. That could be enabled by -T and -t.
>
> That's a compromise I could live with.
Can we achieve that internally with either the data or the lexically
enclosing scope (not sure which) holding a reference to the encoding to
treat the data as, using it for upgrade(/downgrade) purposes, and croaking
if it's needed and undefined/NULL.
Sounds like a job for lexical pragmas?
Nicholas Clark
PL_encoding. Not very lexical, I admit.
> Nicholas Clark
An lexical encoding.pm will go a long way toward a happy autrijus.
Also, it strikes me that, since this is a warning by default:
% perl -e 'print chr 256'
Wide character in print at -e line 1.
This should also be made a warning:
% perl -e '$_ = chr 1 . chr 256'
Mixed wide and non-wide characters in concatenation at -e line 1.
That will also make it very easy to diagnose pretty much all upgrading
bugs I've been having.
Thanks,
/Autrijus/
> On Sat, Mar 13, 2004 at 09:20:12PM +0200, Jarkko Hietaniemi wrote:
>
>>>Sounds like a job for lexical pragmas?
>>
>>PL_encoding. Not very lexical, I admit.
>
>
> An lexical encoding.pm will go a long way toward a happy autrijus.
I think MJDs lexical pragmas patches will be much used.
> Also, it strikes me that, since this is a warning by default:
>
> % perl -e 'print chr 256'
> Wide character in print at -e line 1.
>
> This should also be made a warning:
>
> % perl -e '$_ = chr 1 . chr 256'
> Mixed wide and non-wide characters in concatenation at -e line 1.
>
> That will also make it very easy to diagnose pretty much all upgrading
> bugs I've been having.
If people by know hadn't got the idea that we have a nicely conflicting
and varied set of expectations, you just proved it :-)
I think at some point Larry degreed that _inside_ Perl people can do
whatever they pretty much want, mixing different data, creating illegal
UTF-8, whatever, but it's at the I/O boundary people need to get whined at.
And now I really need to figure out why my filters aren't letting me out
of this discussion.
>
> Thanks,
> /Autrijus/
Right. When he arrives at Taipei I'll try to lock him in my house until
he shows me how to work with the said patches. Of course, I'll stay there
with him too...
> > This should also be made a warning:
> >
> > % perl -e '$_ = chr 1 . chr 256'
> > Mixed wide and non-wide characters in concatenation at -e line 1.
> >
> > That will also make it very easy to diagnose pretty much all upgrading
> > bugs I've been having.
>
> If people by know hadn't got the idea that we have a nicely conflicting
> and varied set of expectations, you just proved it :-)
'fraid so.
> I think at some point Larry degreed that _inside_ Perl people can do
> whatever they pretty much want, mixing different data, creating illegal
> UTF-8, whatever, but it's at the I/O boundary people need to get whined at.
I do find that decree questionable and leads to hard-to-diagnose errors;
because at the I/O boundary one has zero information about where the
silent upgrading went wrong.
I'd be quite happy to have it as an optional warning (i.e. only seen with
-w/use warnings). "perl -t" is an acceptable compromise.
> And now I really need to figure out why my filters aren't letting me out
> of this discussion.
:-)
Thanks,
/Autrijus/
Perhaps the current set of patches should go into the development
versions of Perl, so that people can start trying them out. The
feature is not completely complete, but it is complete enough to use.
The two big technical issues remaining are:
1. Garbage collection of pragma data blocks
2. Propagation of pragmas into string 'eval'
#1 may be thorny, but I am thinking about it.
The patch I sent includes documentation and an example pragma. Have
you looked at those?
In the absence of a pragma (I suppose C<use locale> would suffice), I
think that should croak.
> "Me no understandee what you meanee by 0xFF!"
Exactly.
> I suggest anybody who is confused by all this study the differences
> between ACS (abstract character repertoire), CCS (coded character set)
> and encodings of CCSes.
I guess I've got some homework to do.
locale, yechhh. Look at what the encoding pragma does, rather.
> think that should croak.
>
>>I suggest anybody who is confused by all this study the differences
>>between ACS (abstract character repertoire), CCS (coded character set)
>>and encodings of CCSes.
>
> I guess I've got some homework to do.
Long story short:
ACR - an unordered set of abstract characters: "an A, a B, a 3".
CCS - a mapping of an ACR to numbers, usually to non-negative integers.
encoding - a serialization of the CCS numbers to bytes.
With legacy byte systems these levels could and were easily confused
because they looked pretty much the same. With Unicode (and even
already with Asian systems in pre-Unicode days) things Get Confusing
since the question what constitutes a "character" gets tricky.
Check out perluniintro.
More information from e.g. the Unicode consortium web pages.
P.S. I figured out why I am still seeing this darned discussion :-)
I have email filters on both the server side and client side, and my
client side filters are left twiddling their digits because messages
are no more where they expect to find them.
I had. Although being not particularly well-versed in C (but quite
willing to learn), I'm not sure whether it works as simply as setting
${^ENCODING} on compile time, or if ${^ENCODING} needs to be changed
into something inside %^H, or if I'm totally off the track here.
So I'll still need your advise. :-)
Thanks,
/Autrijus/
But if that is the approach e.g. Tk is taking, there's still a problem.
> 2. It's less obvious whether the assumption that all byte strings are
> 8859-1 for purposes of UTF-8 conversion can be fixed. Still, I
> think it's worth trying. Maybe we can reduce the collateral damage
> enough to make it OK.
I'm going to jump in here before even reading the 20+ followups, to say
a few things:
Perl's UTF8 flag has two meanings; one is that the data is unicode
characters, the other that is that the data is code points that happen
to be utf8 encoded. According to one meaning, having locale influence
anything is totally wrong; according to the other, it is totally
right.
Perhaps an upgrade with C locale should translate 0x80-0xff to different
(made up by us) code points?
Do the unicode consortium people have anything helpful to say about
this issue?
Urk. Sounds a little unfriendly. What happens with C<use locale> and
a C locale? (Keeping in mind that POSIX only requires there to be a C
locale; anything beyond that is optional.)
I cannot figure out why people are even trying to mix locales and
Unicode. To me they are pretty orthogonal, and in the cases where they
cross paths (like the \w et alia), I think Unicode should win. They
might cross paths also when one thinks about the messages programs
should give (e.g. fr_FR.UTF-8), but that is irrelevant to Perl as long
as all our messages are in English.
If someone is reading up on Unicode and i18n in general, you can ignore
all the references to the wcs* and mb* functions of various C libraries,
because first of all the C standards do not say what "wide characters"
mean, and secondly Perl uses none of that framework but instead uses its
own UTF-8-based storage and APIs.
> Perhaps an upgrade with C locale should translate 0x80-0xff to different
> (made up by us) code points?
>
> Do the unicode consortium people have anything helpful to say about
> this issue?
Off-hand I can't remember seeing any guidelines like that.
All strings are converted to/from utf8/Unicode on IO. All strings in
the core are utf8/Unicode, or act as if they are.
All file handles, including the one the script is being read from, have
an associated encoding. By default, this is utf8 for scripts, and what
the locale says for STDIN/STDOUT/STDERR. Outputting a character above
0x7F to an implicitly utf8 STD* causes a default-on warning the first
time it is done per execution of the script.
Outputting a character unrepresentable in the encoding of the filehandle
causes a default-on warning (which should give the Unicode code point,
the encoding, and the character name, if charnames is already loaded).
Now for the hard bits.
\d, \w (and by implication \b), and similar get a "use re
'<localename>'" pragmata (which scopes lexically). At least the
following should work: ASCII, which gives the "traditional" meaning: \w
is [A-Za-z0-9_], \d is [0-9]. Unicode gives \w-ness and \d-ness
according to the Unicode properties. locale gives them according to the
locale of the running program. (Note that I say at least; there may be
a method of specifying arbitrary bitmaps as well. For that matter, some
locales may have ideas about what is an isn't a letter that disagree
with both of these, though I personally doubt it.) "C" should probably
be an alias to ASCII, though I sometimes find it confusing in this
discussion that it means the locale called C, and not what libc thinks
your locale is.
Come to think of it, it may be better to give it a different name, such
as "use strings". The reason is that lc and uc are not as
locale-independent as often thought. (I lowercases to i in most
locales, but in Turkish and one other language that I cannot recall at
the moment, to lowercase dotless i.)
Note that the above pragmata is completely implementable (perhaps except
for the lexical behavior) using overload's constant overloading, I think.
The hard part now is deciding on the default behavior, which I think
should be use re 'Unicode'. This may break things that expect \w and \d
to have very specific meanings, but I would argue that such things are
broken by design, by not specifying what they actually mean. (Note that
word chars outside of A-Za-z are nothing new, they were just broken
until recently. "über" contains four word characters, no matter what
anybody else says. In any case, if we provide a simple one-line fix,
it's all good, to my way of thinking.
We may want to find a better name for this, as uc and lc should be
effected by it as well (case mapping is mostly locale independent, but
not quite -- in Turkish, I lowercases not to i, but lowercase dotless i.
There is a similar problem with i and uppercase dotted I. There is at
least one other example, which I cannot recall. In any case, it is
quite different under use re 'ASCII'.)
As to XS: The currently-existing problem with Tk goes away: All the
strings are utf8, which is what it wants anyway. Instead, we've got
more problems. Everything which wanted non-utf8 strings before (and
probably thought it didn't really care what encoding it got them in) now
has to call a reencoding function, which most absolutely should not
change the encoding of the source string (though it may cache a
reencoded version). There should be at least two versions of this
function: One which returns NULL (or similar) on unrepresentable
characters (for XS things that want to handle such conditions itself),
and one that warns and replaces the unrepresentable chars with a
placeholder character. (There should probably not be an option for
"replace for placeholder without warning", though the warning is, of
course, optional.)
So, problems that I see:
* Backwards compatibility.
+ Esp with XS, which now has it's world-view be radically different.
+ With existing scripts.
- making use re 'ASCII' be the default would probably minimize this,
but at the expense of being sick and wrong.
- making the default be "match the encoding of the most recently
mentioned file handle" would probably fix it for most cases,
at the expense of being hugely ugly.
* Increased footprint
+ Essentially makes almost everything need Encoding
+ Including a nice interface to it from XS
+ Is this really avoidable?
+ Lots of roundtrips to utf8
+ This may be avoidable, which is why I said "or acts like it".
More on that last point:
If we always access the string value of a SVPV through a macro/function,
we can try to be smart about when we translate. But this should be
carefully kept apart from the current duel (pun intended) model.
Utf8/Unicode is the one true way, and everything else is subservient to
it. If we can only keep one around, that's the one. This assumes
everything is round-tripable to utf8 (which is a major design goal of
Unicode anyway), and if it's not, that we don't care about the data loss
that will ensue (but note that we can cheat and use the vendor-reserved
area).
This is already longer then I wanted it to be, so I'll leave off the
strong conclusion. I never was any good at writing conclusions.
Hoping that all the re-hashing will bring somebody a good high,
-=- James Mastros
What's left to argue about:
1) Why this is a bad idea after all.
2) What the names/calling conventions should be.
3) What the defaults should be.
1) I think this is a good idea, obviously, or I wouldn't be floating it.
I haven't been around here that long, though, and would love to hear
reasons to the contrary.
2) I think the regex behavior should be set by the existing use re.
use re 'unicode' -- \w, \d, etc are based on the approps unicode
properties. (IE the existing behavior, if the string being done on has
it's utf8 bit set.) (Alias: Latin-1.)
use re 'locale' -- \w, \d, etc take their definition from what libc
thinks the current locale is. (Alias: ENV.)
use re 'ASCII' -- \w, \d, etc take their traditional meaning. (Alias:
traditional.)
use re 'magic' -- ASCII if the string being tested does not have it's
utf8 bit set, unicode if it does. This is the current behavior.
(Alias: Broken.)
(Names should be case-insensitive.)
I'm afraid I have no good idea of where the recoding behavior should be set.
3) I think the default should be use re 'magic', that is, current
behavior. Broken as it is, too many people expect it.
-=- James Mastros
> Another way is to consider all strings as a sequence of string segments,
>
>where each segment itself is flagged with "unicode" or "not unicode"
>(again a flag separate from utf8). Then the above example never looses
>track of the fact that that first character was't unicode.
>
This breaks Perl's historic low-level concept of a string a block of
memory, although it
does give us more powerful string manipulation semantics and possibly a
SUBSTR tie method.
--
david...@pay2send.com.
Include phrase "cat and buttered toast" to get through my filter
Right - we can't use what is in the string or how string is represented
as a marker. If Unicode chars are in use anywhere in the program there
is a risk they will leak into legacy code.
(This is big snag with perl5.6's utf8 stuff.)
So we need a scheme that has meaning (even if that is 'croak', but I
think we can do better) in such cases.
I can agree to that.
But that brings me back to where I came in - we currently have in our
API:
=for apidoc Am|char*|SvPVutf8|SV* sv|STRLEN len
Like C<SvPV>, but converts sv to utf8 first if necessary.
For XS modules that are trying to become Unicode aware 'tis what we use.
It upgrades in place.
If we are to stop Tk et. al. messing with SVs like that then we need
a way to fake this API.
What about something based on Sadahiro's suggestion ? eventually to
be introduced in PPPort.
I guess Tk can (but probably doesn't) make copies of most SVs.
For those it can't copy e.g.
$entry->configure(-textvariable => \$price);
Here if user types €2 then $price HAS to be upgraded.
>
>> 2. It's less obvious whether the assumption that all byte strings are
>> 8859-1 for purposes of UTF-8 conversion can be fixed. Still, I
>> think it's worth trying. Maybe we can reduce the collateral damage
>> enough to make it OK.
>
>I'm going to jump in here before even reading the 20+ followups, to say
>a few things:
>
>Perl's UTF8 flag has two meanings; one is that the data is unicode
>characters, the other that is that the data is code points that happen
>to be utf8 encoded. According to one meaning, having locale influence
>anything is totally wrong; according to the other, it is totally
>right.
>
>Perhaps an upgrade with C locale should translate 0x80-0xff to different
>(made up by us) code points?
Hmm - 1st new idea for a while.
What we currently have though is perl defaulting to a "Perl" locale,
which is a superset of C/POSIX.
>
>Do the unicode consortium people have anything helpful to say about
>this issue?
There are some "private use" areas of the codepoint space.
Sadahiro's code looks like a solid way to be sure to get UTF-8.
Consolidating that latest version of that with the sv_copypv stuff
it becomes:
void
unicode_upgrade(sv)
SV * sv
PROTOTYPE: $
PREINIT:
char *s;
STRLEN len;
PPCODE:
s = SvPV(sv,len); /* mg_get(sv) happens here */
if (!SvUTF8(sv)) {
#ifdef sv_copypv /* perl 5.7.3 or later */
SV* tmp_sv = sv_newmortal();
sv_copypv(tmp_sv, sv);
#else
SV* tmp_sv = sv_2mortal(newSVpvn(s,len));
#endif
if (SvPOKp(tmp_sv)) /* taintedness ignored */
SvPOK_on(tmp_sv);
sv_utf8_upgrade(tmp_sv);
sv = sv_mortalcopy(tmp_sv); /* taintedness recovered */
}
XPUSHs(sv);
That results in either the original SV or a mortal copy.
So there are potential lifetime issues with it, but otherwise
it looks fine.
However as it stands it is an XSUB. I don't want to have to
call an XSUB so it needs wrapping up inside SvPVutf8().
If SV is already correct then macro version is fine, so
inner sv_2pvutf8 would bundle logic above:
We have now:
char *
Perl_sv_2pvutf8(pTHX_ register SV *sv, STRLEN *lp)
{
sv_utf8_upgrade(sv);
return SvPV(sv,*lp);
}
So does that become:
char *
Perl_sv_2pvutf8(pTHX_ register SV *sv, STRLEN *lp)
{
char *s = SvPV(sv,*lp); /* mg_get(sv) happens here */
if ((SvPOK(sv) || SvPOKp(sv)) /* only bother if string? */
&& !SvUTF8(sv)) {
SV* tmp_sv = sv_newmortal();
sv_copypv(tmp_sv, sv);
if (SvPOKp(tmp_sv)) /* taintedness ignored - ??? */
SvPOK_on(tmp_sv);
sv_utf8_upgrade(tmp_sv);
s = SvPV(temp_sv,*lp); /* will hit macros now? */
}
return s;
}
That does double magic (and double overload-call, too). sv_copypv
should be used on overloaded sv's *instead of* the:
s = SvPV(sv,len); /* mg_get(sv) happens here */
if (!SvUTF8(sv)) {
for 5.8.0 only.
> char *s = SvPV(sv,*lp); /* mg_get(sv) happens here */
> if ((SvPOK(sv) || SvPOKp(sv)) /* only bother if string? */
Not everything that will stringify will set POK/POKp.
Which is why I posted questions and not a patch.
Can you please suggest code for Perl_2pvutf8 which incorporates
all the snags you have discovered?
The goal is to get that fixed to return a valid UTF-8 string.
Then any XS that is using published API will get what we
promissed.
I don't understand the stuff for tainting, but I'm guessing something
like this:
#if (PERL_VERSION == 7 && PERL_SUBVERSION == 3) || (PERL_VERSION == 8 && PERL_SUBVERSION == 0)
if (SvAMAGIC(sv)) {
SV * tmp_sv = sv_newmortal();
sv_copypv(tmp_sv, sv);
if (SvPOKp(tmp_sv))
SvPOK_on(tmp_sv);
sv_utf8_upgrade(tmp_sv);
sv = sv_mortalcopy(tmp_sv);
}
else
#endif
{
s = SvPV(sv,len); /* mg_get(sv) happens here */
if (!SvUTF8(sv)) {
SV* tmp_sv = sv_2mortal(newSVpvn(s,len));
Whoops, that didn't check the UTF8 flag in the AMAGIC part.
This is better and avoids duplicating the hairy taintedness
stuff. Untested (even uncompiled):
#if (PERL_VERSION == 7 && PERL_SUBVERSION == 3) || (PERL_VERSION == 8 && PERL_SUBVERSION == 0)
/* for these perl versions, the only way to preserve the UTF8 flag
from "" overload is via sv_copypv */
U32 overloaded = SvAMAGIC(sv);
if (overloaded) {
SV *tmp_sv = sv_newmortal();
sv_copypv(tmp_sv, sv);
sv = tmp_sv;
}
/* for perl 5.8.1 and on, SvPV will return the string overload in s/len
and set the UTF8 flag on the RV, so nothing special is required */
#endif
s = SvPV(sv, len);
if (!SvUTF8(sv)) {
SV *tmp_sv;
#if (PERL_VERSION == 7 && PERL_SUBVERSION == 3) || (PERL_VERSION == 8 && PERL_SUBVERSION == 0)
/* don't create a copy if we already have one */
if (!overloaded)
#endif
for Perl_2pvutf8 we don't XPUSHs put just return the char *s.
Which is where my 2nd SvPV came from. You need to get PV after
the upgrade.
Also for Perl_2pvutf8 which is part-of-perl we don't need the #if
stuff as we know which perl we are.
So it needs a better name. 'upgrade' seems wrong here.
Perhaps sv_as_utf8(sv) or SvSVutf8(sv)
Tim.
This is just for input to XS, right? You'll need something quite different
for output.
What do you actually want the function to return? A utf8 pv and length
or a full sv?
If the latter, I think you want to return a copy whether or not you
are upgrading to avoid subtle differences where the XS affects the
original SV if it was utf8 but not if it wasn't.
If the former, I think you want not a new function, but a SV_ASUTF8
flag on sv_2pv_flags (with a macro interface that is recommended for xs
use, and a ppport.h implementation of the macro).
What _I_ and trying to get it to do is make the existing
SvPVutf8() API actually work for these corner cases.
>
>Tim.
I want the EXISTING (apidoc POD in sv.h )
SvPVutf8(sv,len)
SvPVutf8_nolen()
and perhaps
SvPVutf8_force()
macros to work.
>You'll need something quite different
>for output.
Yes.
>
>What do you actually want the function to return? A utf8 pv and length
>or a full sv?
The macros have effective prototype (in C++ ese):
char *SvPVutf8(const SV *sv,strlen &len)
>
>If the latter,
The former - it is a char *.
If it was an SV then recipient could fix the problem themselves.
But given a char * it has to be certain to be UTF-8, not UTF-8 when perl
feels like it.
>
>If the former, I think you want not a new function,
I keep saying that - I do not want a new function, I want
a fix for the one we have - Perl_sv_2pvutf8() in sv.c
which is broken.
Once we have a fixed version in some perl somewhere we can start propagating
it to ppport.h
> I want the EXISTING (apidoc POD in sv.h )
> SvPVutf8(sv,len)
> SvPVutf8_nolen()
> and perhaps
> SvPVutf8_force()
>
> macros to work.
At least I know the following one is broken and incompatible
with current Perl_sv_2pvutf8 when PL_encoding is true.
So this is not given as a patch.
/* an example of usage in XSUB:
CODE:
s = SvPVutf8(sv,len); /* s should be encoded in utf8 */
RETVAL = newSVpvn(s,len);
SvUTF8_on(RETVAL);
OUTPUT:
RETVAL
*/
char *
Perl_sv_2pvutf8(pTHX_ register SV *sv, STRLEN *lp)
{
char *s;
STRLEN len;
s = SvPV(sv,len);
if (!SvUTF8(sv)) { /* bytes.pm should be ignored. */
U8 *e, *t;
bool need_convert = 0;
/* PL_encoding is ignored, since sv_recode_to_utf8() does not
give utf8 string when *both* encoding.pm and bytes.pm are used.
It may be dangerous if "always utf8" is expected.
A byte sting is assumed as encoded in Latin-1/EBCDIC. */
e = (U8*)s + len;
for (t = (U8*)s; t < e; t++) {
if (!NATIVE_IS_INVARIANT(*t)) {
need_convert = 1;
break;
}
}
t = NULL;
if (need_convert) {
SV* tmpsv;
t = bytes_to_utf8((U8*)s, &len);
tmpsv = sv_2mortal(newSVpvn(t,len));
SvUTF8_on(tmpsv);
s = SvPVX(tmpsv);
}
if (t)
Safefree(t);
}
*lp = len;
return s;
}
Regards,
SADAHIRO Tomoyuki
If sv_pvn_force is performed before upgrade/downgrade,
sv_utf8_(up|down)grade should get an SV made POK through sv_pvn_force.
So I think they with the following modifications should cope
with a non-POK Scalar well.
In the whole perl distribution, these functions seem to be used
only in pp_sysread via SvPVutf8_force macro in one place...
(Other appearances are in macros defined in header files.)
How can regression tests be executed?
--- sv.c~ Sun Mar 21 22:47:12 2004
+++ sv.c Tue Mar 23 23:30:02 2004
@@ -7834,8 +7834,10 @@
char *
Perl_sv_pvbyten_force(pTHX_ SV *sv, STRLEN *lp)
{
+ sv_pvn_force(sv,lp);
sv_utf8_downgrade(sv,0);
- return sv_pvn_force(sv,lp);
+ *lp = SvCUR(sv);
+ return SvPVX(sv);
}
/* sv_pvutf8 () is now a macro using Perl_sv_2pv_flags();
@@ -7883,8 +7885,10 @@
char *
Perl_sv_pvutf8n_force(pTHX_ SV *sv, STRLEN *lp)
{
+ sv_pvn_force(sv,lp);
sv_utf8_upgrade(sv);
- return sv_pvn_force(sv,lp);
+ *lp = SvCUR(sv);
+ return SvPVX(sv);
}
/*
End of patch.
Regards,
SADAHIRO Tomoyuki
Thanks, applied as #22652 to bleadperl.