DATA
handle, you must explicitly set its
encoding."), and lib/utf8.t, which admittedly I haven't given more than a cursory glance, but I get the feeling it's testing things in a completely backwards way, by using "use utf8/no utf8" inside string evals to test how UTF-8 hash keys work (that also makes me wonder whenever the tests are misplaced, as there's a op/utfhash.t).> These two programs show the crux of the issue:
>
> perl -CS -E 'use utf8; my $prog = "say qq!\x{f9}!"; eval $prog;
> utf8::upgrade($prog); eval $prog;'
> perl -le 'use utf8; print eval "q!\360\237\220\252!" eq eval "q!\x{1f42a}!"
> '
>
> On the former, to paraphrase Zefram, "use utf8;" shouldn't affect the
> correctness of the evaled program. And the latter shows why eval shouldn't
> pay attention to the hints of it's enclosing scope, but only of the scalar
> passed in.
> Fixing this, however, steps on a landmine; It's not particularly
> backwards-compatible. When working correctly, the first eval in the first
Bugs are allowed to be fixed. Is this a bug?
> program stops being a syntax error, and the eq in the second program returns
> false. And things that expected to pass octets to eval and have them
> interpreted as UTF-8 will now suddenly break; There are at least two such
> occurrences in the test suite right now, both which are kind of buggy on
> their own right: op/utfhash.t, which reads from DATA without setting an
> encoding on the filehandle and passes the return value to eval, expecting
> them to be interpreted as UTF-8 (Which is a mortal sin, don'tyouknow; To
> quote Tom Christiansen, "If you have a DATA handle, you must explicitly set
> its encoding."), and lib/utf8.t, which admittedly I haven't given more than
Why? Surely the encoding of the DATA handle should be the same as the
encoding of the source code? and if C<use utf8;> is meant to signal that
this source code is in UTF-8, shouldn't the DATA handle then be UTF-8?
[Sort of related. IIRC the automatic BOM-and-UTF-16 spotter in
S_swallow_bom() is inconsistent. In that IIRC has the equivalent effect of
a 'use utf8' for UTF-16, but not for eating a UTF-8 BOM. And I think
probably it should be consistent, with the current UTF-16 approach being
correct. Probably also it should be rewritten to push a PerlIO layer,
instead of having a private custom source filter, as that would fix one of
the "other" "bugs" [well, inconsistency], that the t/TEST utf8 and utf16
options (last I checked) only failed because <DATA> was no longer in the
encoding they expected. But this is self-contained. Better to finish what
you've started than also try to bite this off.]
> a cursory glance, but I get the feeling it's testing things in a completely
> backwards way, by using "use utf8/no utf8" inside string evals to test how
> UTF-8 hash keys work (that also makes me wonder whenever the tests are
> misplaced, as there's a op/utfhash.t).
Possibly. But I think the tests you're referring to were added to lib/utf8.t
in 2001 by 4c26891c6a00d6f5, whereas t/op/utfhash.t wasn't created until 2002.
So it's more that neither Nick Ing-Simmons at the time, nor anyone
subsequently, felt like cleaning up the locations of the various tests.
Nicholas Clark
Bugs are allowed to be fixed. Is this a bug?
Why? Surely the encoding of the DATA handle should be the same as the
encoding of the source code? and if C<use utf8;> is meant to signal that
this source code is in UTF-8, shouldn't the DATA handle then be UTF-8?
[Sort of related. IIRC the automatic BOM-and-UTF-16 spotter in
S_swallow_bom() is inconsistent. In that IIRC has the equivalent effect of
a 'use utf8' for UTF-16, but not for eating a UTF-8 BOM. And I think
probably it should be consistent, with the current UTF-16 approach being
correct. Probably also it should be rewritten to push a PerlIO layer,
instead of having a private custom source filter, as that would fix one of
the "other" "bugs" [well, inconsistency], that the t/TEST utf8 and utf16
options (last I checked) only failed because <DATA> was no longer in the
encoding they expected. But this is self-contained. Better to finish what
you've started than also try to bite this off.]
Possibly. But I think the tests you're referring to were added to lib/utf8.t
in 2001 by 4c26891c6a00d6f5, whereas t/op/utfhash.t wasn't created until 2002.
So it's more that neither Nick Ing-Simmons at the time, nor anyone
subsequently, felt like cleaning up the locations of the various tests.