GSOC Status Report, Week 9

Brian Fraser

unread,

Jul 27, 2011, 4:41:28 AM7/27/11

to tpf-gsoc...@googlegroups.com, Perl5 Porters Mailing List, Florian Ragwitz, Father Chrysostomos, Zefram

Howdy people.

Last week I took a couple of extra stabs at rebasing, and now the repo is in fairly better shape - Reviews are welcome as usual! I also dived into mg.c (as it shared the same magic-granting errors that I tackled in gv.c the previous week) and toke.c (though nothing much is ready for pushing at the moment). Cleaning up attributes turned out to be trivial, so that's done. Huzzah!

Finally, in the last few days, I've been taking a look at string eval, with much appreciated handholding from Zefram : )
These two programs show the crux of the issue:

perl -CS -E 'use utf8; my $prog = "say qq!\x{f9}!"; eval $prog; utf8::upgrade($prog); eval $prog;'
perl -le 'use utf8; print eval "q!\360\237\220\252!" eq eval "q!\x{1f42a}!" '

On the former, to paraphrase Zefram, "use utf8;" shouldn't affect the correctness of the evaled program. And the latter shows why eval shouldn't pay attention to the hints of it's enclosing scope, but only of the scalar passed in.
Fixing this, however, steps on a landmine; It's not particularly backwards-compatible. When working correctly, the first eval in the first program stops being a syntax error, and the eq in the second program returns false. And things that expected to pass octets to eval and have them interpreted as UTF-8 will now suddenly break; There are at least two such occurrences in the test suite right now, both which are kind of buggy on their own right: op/utfhash.t, which reads from DATA without setting an encoding on the filehandle and passes the return value to eval, expecting them to be interpreted as UTF-8 (Which is a mortal sin, don'tyouknow; To quote Tom Christiansen, "If you have a DATA handle, you must explicitly set its encoding."), and lib/utf8.t, which admittedly I haven't given more than a cursory glance, but I get the feeling it's testing things in a completely backwards way, by using "use utf8/no utf8" inside string evals to test how UTF-8 hash keys work (that also makes me wonder whenever the tests are misplaced, as there's a op/utfhash.t).

As for this week, I'll continue working on string eval and reviewing the GV/stash stuff, plus tackling whatever toke.c throws at me (I already did a bit of work on the tokenizer in other branches while working on GVs, so chances are I'll also go back and see if there's anything usable there).
And finally, the sidequest of the week involves ironing out a few wrinkles that pop out when using UTF-8 labels with characters in the latin-1 range.

Nicholas Clark

unread,

Aug 3, 2011, 7:28:53 AM8/3/11

to Brian Fraser, tpf-gsoc...@googlegroups.com, Perl5 Porters Mailing List, Florian Ragwitz, Father Chrysostomos, Zefram

On Wed, Jul 27, 2011 at 05:41:28AM -0300, Brian Fraser wrote:

> These two programs show the crux of the issue:
>
> perl -CS -E 'use utf8; my $prog = "say qq!\x{f9}!"; eval $prog;
> utf8::upgrade($prog); eval $prog;'
> perl -le 'use utf8; print eval "q!\360\237\220\252!" eq eval "q!\x{1f42a}!"
> '
>
> On the former, to paraphrase Zefram, "use utf8;" shouldn't affect the
> correctness of the evaled program. And the latter shows why eval shouldn't
> pay attention to the hints of it's enclosing scope, but only of the scalar
> passed in.
> Fixing this, however, steps on a landmine; It's not particularly
> backwards-compatible. When working correctly, the first eval in the first

Bugs are allowed to be fixed. Is this a bug?

> program stops being a syntax error, and the eq in the second program returns
> false. And things that expected to pass octets to eval and have them
> interpreted as UTF-8 will now suddenly break; There are at least two such
> occurrences in the test suite right now, both which are kind of buggy on
> their own right: op/utfhash.t, which reads from DATA without setting an
> encoding on the filehandle and passes the return value to eval, expecting
> them to be interpreted as UTF-8 (Which is a mortal sin, don'tyouknow; To
> quote Tom Christiansen, "If you have a DATA handle, you must explicitly set
> its encoding."), and lib/utf8.t, which admittedly I haven't given more than

Why? Surely the encoding of the DATA handle should be the same as the
encoding of the source code? and if C<use utf8;> is meant to signal that
this source code is in UTF-8, shouldn't the DATA handle then be UTF-8?

[Sort of related. IIRC the automatic BOM-and-UTF-16 spotter in
S_swallow_bom() is inconsistent. In that IIRC has the equivalent effect of
a 'use utf8' for UTF-16, but not for eating a UTF-8 BOM. And I think
probably it should be consistent, with the current UTF-16 approach being
correct. Probably also it should be rewritten to push a PerlIO layer,
instead of having a private custom source filter, as that would fix one of
the "other" "bugs" [well, inconsistency], that the t/TEST utf8 and utf16
options (last I checked) only failed because <DATA> was no longer in the
encoding they expected. But this is self-contained. Better to finish what
you've started than also try to bite this off.]

> a cursory glance, but I get the feeling it's testing things in a completely
> backwards way, by using "use utf8/no utf8" inside string evals to test how
> UTF-8 hash keys work (that also makes me wonder whenever the tests are
> misplaced, as there's a op/utfhash.t).

Possibly. But I think the tests you're referring to were added to lib/utf8.t
in 2001 by 4c26891c6a00d6f5, whereas t/op/utfhash.t wasn't created until 2002.

So it's more that neither Nick Ing-Simmons at the time, nor anyone
subsequently, felt like cleaning up the locations of the various tests.

Nicholas Clark

Brian Fraser

unread,

Aug 4, 2011, 2:13:22 AM8/4/11

to Nicholas Clark, tpf-gsoc...@googlegroups.com, Perl5 Porters Mailing List, Florian Ragwitz, Father Chrysostomos, Zefram

On Wed, Aug 3, 2011 at 8:28 AM, Nicholas Clark <ni...@ccl4.org> wrote:

Bugs are allowed to be fixed. Is this a bug?

I think it is. But I went out of the echo chamber and tried it in Ruby, and got similar results there, so I'm not sure (Ruby does have better reasons for the behavior though, as the String object there keeps the encoding under which it was compiled; We have no such excuse: my $latin_1 = "q!\360\237\220\252!"; say eval $latin_1 eq "\x{1f42a}"; { use utf8; say eval $latin_1 eq "\x{1f42a}"; } returns false and true, but I believe that something of an equivalent in Ruby would return false to both - Though I can't test this until Fridayish).

Why? Surely the encoding of the DATA handle should be the same as the
encoding of the source code? and if C<use utf8;> is meant to signal that
this source code is in UTF-8, shouldn't the DATA handle then be UTF-8?

Without giving it much thought I'd say "No." use utf8; should only mean that the source is in UTF-8, no side effects, otherwise you are going down the road that use encoding did - Too many meanings and defaults. Setting layers should be left to open.pm, or done explicitly (though a way of setting DATA's encoding through -C/PERL_UNICODE/open would be nice, I guess). Plus, just think of the crapstorm that would ensue when someone asks what layer it should push! :utf8? :encoding(UTF-8)?

But that's just me, and I've been heinously wrong before. : )

[Sort of related. IIRC the automatic BOM-and-UTF-16 spotter in
S_swallow_bom() is inconsistent. In that IIRC has the equivalent effect of
a 'use utf8' for UTF-16, but not for eating a UTF-8 BOM. And I think
probably it should be consistent, with the current UTF-16 approach being
correct. Probably also it should be rewritten to push a PerlIO layer,
instead of having a private custom source filter, as that would fix one of
the "other" "bugs" [well, inconsistency], that the t/TEST utf8 and utf16
options (last I checked) only failed because <DATA> was no longer in the
encoding they expected. But this is self-contained. Better to finish what
you've started than also try to bite this off.]

I have to admit that I've been ignoring S_swallow_bom(), but sure, I'll put it in my TODO list.

Possibly. But I think the tests you're referring to were added to lib/utf8.t
in 2001 by 4c26891c6a00d6f5, whereas t/op/utfhash.t wasn't created until 2002.

So it's more that neither Nick Ing-Simmons at the time, nor anyone
subsequently, felt like cleaning up the locations of the various tests.

I am Jack's complete failure to recall git blame. : )

I'll try to salvage the lib/utf8.t tests and put them into op/utfhash.t, though I'm not quite sure what it was originally intending to test, seeing how it didn't take into account eval()'s weird behavior.

Reply all

Reply to author

Forward