GSOC Status Report, Week 11

15 views

Skip to first unread message

Brian Fraser

unread,

Aug 11, 2011, 7:57:45 AM8/11/11

to tpf-gsoc...@googlegroups.com, Perl5 Porters Mailing List, Florian Ragwitz, Father Chrysostomos, Zefram, Karl Williamson

Howdy all.

Slightly frustrating week, unfortunately. I continued the toke.c cleanup and organized the repo from utter mess to normal mess, then decided to go back and fix two outstanding GV/Stash related bugs:

First, an hv_common() call in sv_isa_lookup_pvn_flags() was failing when it used the hash that came with the class's GV's HEK, if the class was fully-qualified and in UTF-8, but only for class names within the Latin-1 range.

So, for example,

package À { sub new { bless {}, shift } } my $obj = À->new; say $obj->isa("À"); say $obj->isa("main::À");

That second ->isa would wrongly return false. [*]

In hindsight, the reason was pretty obvious. TomC's talk of normalization in the other thread made things click for me; Neither hv_name_set (and related functions) nor gv_init downgraded input when inserting new names. I think this was a conscious decision early on, because all of those functions internally call share_hek(), which does downgrade whenever possible.

What I hadn't noticed early on was that the hash passed to share_hek wasn't modified along with the pv/len; Now I'm downgrading before calculating the hash, and things are working fine, apparently. So huzzah.

The second bug was something I mentioned in a reply to the previous report; trying to load swashes too late (i.e. after a croak from the tokenizer) resulted in a 'do FILE' in utf8_heavy.pl dying from a compilation error.

I still have no clue why. But a workaround (that doesn't suck as much as switching the isIDFIRST checks to something else) is simply loading the XIDS swash early on. Putting something like

if (UTF) {

bool tmpool;

tmpbool = is_utf8_xidfirst((U8*)"");

}

near the top of yylex() should make do for that; I haven't traveled far enough into SWASH territory to figure out a less ham-fisted solution.

In the next few days I'll push those changes to the stash/gv repo, unless someone objects; Meanwhile, toke.c still needs work.

Moving on, since we are nearing the end of GSOC, here are some of my TODOs that lack rt tickets and most likely won't be completed before the deadline:

The second argument of open() is forced into octets; This means that :via(N) and :encoding(N), with cleaned up stashes/GVs, are now broken for UTF-8 names.
reset()
Source filters still have the UTF-8ness of their returned SVs ignored.
S_shallow_bom in toke.c.
use encoding; is still broken, with no replacement in sight.
Bunch of warnings in the reg*.c files (the POSIX syntax/class errors, et al).
gv_fetchfile. This is actually trivial and could probably be done in a couple of minutes, but I'm reluctant to push something without tests.

It's a tad saddening to have use encoding; on that list, but I'd rather not risk rushing it and end up causing an even bigger mess. There's enough time after the evaluations, anyway : )

[*] Weekly how to be an ass to your maintenance programmer:

package A { @ISA = qw(B) } my $obj = bless {}, "A"; $obj->isa("B"); $obj->isa("main::B");

The former ->isa will always work; The latter only will if B is in the symbol table.

Nicholas Clark

unread,

Aug 13, 2011, 10:30:40 AM8/13/11

to Brian Fraser, tpf-gsoc...@googlegroups.com, Perl5 Porters Mailing List, Florian Ragwitz, Father Chrysostomos, Zefram, Karl Williamson

On Thu, Aug 11, 2011 at 08:57:45AM -0300, Brian Fraser wrote:

> In hindsight, the reason was pretty obvious. TomC's talk of normalization in
> the other thread made things click for me; Neither hv_name_set (and related
> functions) nor gv_init downgraded input when inserting new names. I think
> this was a conscious decision early on, because all of those functions
> internally call share_hek(), which does downgrade whenever possible.
> What I hadn't noticed early on was that the hash passed to share_hek wasn't
> modified along with the pv/len; Now I'm downgrading before calculating the
> hash, and things are working fine, apparently. So huzzah.

That's a bug in Perl_share_hek() then. As it's taking it upon itself to change
which string is stored, it ought to be calculating the hash corresponding to
the change it unilaterally made.

> The second bug was something I mentioned in a reply to the previous report;
> trying to load swashes too late (i.e. after a croak from the tokenizer)
> resulted in a 'do FILE' in utf8_heavy.pl dying from a compilation error.
> I still have no clue why. But a workaround (that doesn't suck as much as
> switching the isIDFIRST checks to something else) is simply loading the XIDS
> swash early on. Putting something like
> if (UTF) {
> bool tmpool;
> tmpbool = is_utf8_xidfirst((U8*)"");
> }
> near the top of yylex() should make do for that; I haven't traveled far
> enough into SWASH territory to figure out a less ham-fisted solution.

That's ugly. It would be nice to work out why it's a problem. I appreciate
that you don't have enough time to be that person.

> - Source filters still have the UTF-8ness of their returned SVs ignored.

Yes. Source filters are bugging me as part of the "what *does* C<use utf8;>"
mean?"