GSOC Status Report, Week 7

1 view
Skip to first unread message

Brian Fraser

unread,
Jul 11, 2011, 3:43:03 AM7/11/11
to tpf-gsoc...@googlegroups.com, Perl5 Porters Mailing List, Florian Ragwitz, Zefram, Father Chrysostomos, Karl Williamson
Hey people.

First order of business, here's the repo with the progress as of now: https://github.com/Hugmeir/gsoc-pad-utf8-safety/commits/stashgv_clean
Feel free to dissect it to oblivion, but as I'm not finished rebasing, please excuse the mess for a while longer.
rafl tried it a couple of days ago and a pair of bugs related to threaded builds and assertions popped out, so I spent most of the weekend hunting those. (doy & hobbs)++, since their combined explanation on unions basically solved one of the issues for me. Also, huge thank you to rafl, mst, HellKat and the Cyberkat.eu folk for setting me up with an awesome ssh that can actually run all of the test suite in a single day (less than ten minutes actually. It's beautiful and somehow keeps getting dirt in my eyes).

And now, some rough spots that need attention:
  • I mentioned that I added new versions of functions with _(pvn|pv|sv) versions; But what should I do about the original versions, which are now little more than wrappers around the new functions? Leave them next to the new ones? Macros somewhere? rafl suggested the macro way, which I prefer the most, but with that I'm not sure where to sic sv_derived_from and sv_does.

  • *ò =~s///r; gives a coercion failure, which isn't present in blead. I can't quite follow pp_subst enough as to figure out a good fix; A bad one (in light of my ignorance) would be to change two s = SvPV_force()s into s = (isGV_with_GP() && (rpm->op_pmflags & PMf_NONDESTRUCT)) ? SvPV() : SvPV_force(); Suggestions/explanations welcome as usual.

  • Until yesterday, GvNAMEUTF8() was doing something like (HEK_UTF8() || HEK_WASUTF8()), as the double check was needed by most places that called SvUTF8() on GVs (pp_concat, stringify, substr, SvPV, etc). However, that double-check was occasionally breaking hv_(fetch|store|common|etc) calls for globs with latin-1 in them, so I removed the HEK_WASUTF8 out of GvNAMEUTF8 and added it explicitly in SvUTF8(). Question is, is there any value to adding a GvNAMEWASUTF8() macro, seeing how there's not much of anything that would end up using it, rather than continue using HEK_WASUTF8(GvNAME_HEK())?

  • Father C, your comments for pp_caller remain unimplemented, because I don't quite understand what you are suggesting :( Would you mind clarifying it a bit? Also, I had to disregard a part of your advice on pp_ref, as using a SV can get it both UTF8 and nul clean, which is hopefully worth the extra memory.
This week I'll finish rebasing and keep on adding tests; Once reviews start coming in, I'll focus on those, but if there's any extra time I'll try to get prototypes clean. Oh, and do the midterm evaluation, I guess.

Nicholas Clark

unread,
Jul 15, 2011, 5:43:02 PM7/15/11
to Brian Fraser, tpf-gsoc...@googlegroups.com, Perl5 Porters Mailing List, Florian Ragwitz, Zefram, Father Chrysostomos, Karl Williamson
On Mon, Jul 11, 2011 at 04:43:03AM -0300, Brian Fraser wrote:

> And now, some rough spots that need attention:
>

> - I mentioned that I added new versions of functions with _(pvn|pv|sv)


> versions; But what should I do about the original versions, which are now
> little more than wrappers around the new functions? Leave them next to the
> new ones? Macros somewhere? rafl suggested the macro way, which I prefer the
> most, but with that I'm not sure where to sic sv_derived_from and sv_does.

Would inline functions work?

There is this "new fangled" PERL_STATIC_INLINE as of 5.14 which gives
C<static inline> where possible, and C<static> on legacy compilers.

> - Until yesterday, GvNAMEUTF8() was doing something like (HEK_UTF8() ||


> HEK_WASUTF8()), as the double check was needed by most places that called
> SvUTF8() on GVs (pp_concat, stringify, substr, SvPV, etc). However, that
> double-check was occasionally breaking hv_(fetch|store|common|etc) calls for
> globs with latin-1 in them, so I removed the HEK_WASUTF8 out of GvNAMEUTF8
> and added it explicitly in SvUTF8(). Question is, is there any value to
> adding a GvNAMEWASUTF8() macro, seeing how there's not much of anything that
> would end up using it, rather than continue using HEK_WASUTF8(GvNAME_HEK())?

I don't think that any code doing comparisons needs to worry about *WASUTF8.
It doesn't affect the (actual) encoding of the sequence of octets in the HEK.

HEKs containing only characters in the range 0-255 are always stored as bytes.
and HVhek_UTF8 is false.

HEKs containing any characters >255 are stored as UTF-8, and HVhek_UTF8 is
true.

That's enough for comparisons.


HVhek_WASUTF8 was added just before 5.8.0 was released to permit C<keys> to
return scalars identically encoded to the scalar used to create the hash key,
thanks to the horrible way that SvUTF8() is used by the core both to signal
encoding and matching semantics. People didn't like it in late 5.7.x when
C<keys> always returned "upgraded" scalars, and other people didn't like it
when C<keys> returned "downgraded" scalars. The only way to please everyone
was to have C<keys> faithfully return whatever people had used to create the
hash. It's only "used" here [complete with typos in the comments], to trigger
a call to bytes_to_utf8() on the *bytes* stored in (HEK_KEY(),HEK_LEN()):

SV *
Perl_newSVhek(pTHX_ const HEK *const hek)
{
dVAR;
if (!hek) {
SV *sv;

new_SV(sv);
return sv;
}

if (HEK_LEN(hek) == HEf_SVKEY) {
return newSVsv(*(SV**)HEK_KEY(hek));
} else {
const int flags = HEK_FLAGS(hek);
if (flags & HVhek_WASUTF8) {
/* Trouble :-)
Andreas would like keys he put in as utf8 to come back as utf8
*/
STRLEN utf8_len = HEK_LEN(hek);
SV * const sv = newSV_type(SVt_PV);
char *as_utf8 = (char *)bytes_to_utf8 ((U8*)HEK_KEY(hek), &utf8_len);
/* bytes_to_utf8() allocates a new string, which we can repurpose: */
sv_usepvn_flags(sv, as_utf8, utf8_len, SV_HAS_TRAILING_NUL);
SvUTF8_on (sv);
return sv;
} else if (flags & (HVhek_REHASH|HVhek_UNSHARED)) {
/* We don't have a pointer to the hv, so we have to replicate the
flag into every HEK. This hv is using custom a hasing
algorithm. Hence we can't return a shared string scalar, as
that would contain the (wrong) hash value, and might get passed
into an hv routine with a regular hash.
Similarly, a hash that isn't using shared hash keys has to have
the flag in every key so that we know not to try to call
share_hek_kek on it. */

SV * const sv = newSVpvn (HEK_KEY(hek), HEK_LEN(hek));
if (HEK_UTF8(hek))
SvUTF8_on (sv);
return sv;
}
/* This will be overwhelminly the most common case. */
{
/* Inline most of newSVpvn_share(), because share_hek_hek() is far
more efficient than sharepvn(). */
SV *sv;

new_SV(sv);
sv_upgrade(sv, SVt_PV);
SvPV_set(sv, (char *)HEK_KEY(share_hek_hek(hek)));
SvCUR_set(sv, HEK_LEN(hek));
SvLEN_set(sv, 0);
SvREADONLY_on(sv);
SvFAKE_on(sv);
SvPOK_on(sv);
if (HEK_UTF8(hek))
SvUTF8_on(sv);
return sv;
}
}
}


Nicholas Clark

Brian Fraser

unread,
Jul 21, 2011, 5:07:36 AM7/21/11
to Nicholas Clark, tpf-gsoc...@googlegroups.com, Perl5 Porters Mailing List, Florian Ragwitz, Zefram, Father Chrysostomos, Karl Williamson
On Fri, Jul 15, 2011 at 6:43 PM, Nicholas Clark <ni...@ccl4.org> wrote:
On Mon, Jul 11, 2011 at 04:43:03AM -0300, Brian Fraser wrote:

Would inline functions work?

There is this "new fangled" PERL_STATIC_INLINE as of 5.14 which gives
C<static inline> where possible, and C<static> on legacy compilers.


I don't think so; the current functions are A in embed.fnc, so they can't be static. :(
 


I don't think that any code doing comparisons needs to worry about *WASUTF8.
It doesn't affect the (actual) encoding of the sequence of octets in the HEK.

HEKs containing only characters in the range 0-255 are always stored as bytes.
and HVhek_UTF8 is false.

HEKs containing any characters >255 are stored as UTF-8, and HVhek_UTF8 is
true.

That's enough for comparisons.


HVhek_WASUTF8 was added just before 5.8.0 was released to permit C<keys> to
return scalars identically encoded to the scalar used to create the hash key,
thanks to the horrible way that SvUTF8() is used by the core both to signal
encoding and matching semantics. People didn't like it in late 5.7.x when
C<keys> always returned "upgraded" scalars, and other people didn't like it
when C<keys> returned "downgraded" scalars. The only way to please everyone
was to have C<keys> faithfully return whatever people had used to create the
hash. It's only "used" here [complete with typos in the comments], to trigger
a call to bytes_to_utf8() on the *bytes* stored in (HEK_KEY(),HEK_LEN()):


Alright, I think I'm following, thank you. The need for WASUTF8 for globs arose from a similar issue - If I create an SV whose name is in UTF-8, I want the save thing I put in to come out when stringifying it - but with the changes to sv.[ch] that came out of your other mail, such a macro would be redundant.

Nicholas Clark

unread,
Aug 3, 2011, 10:13:31 AM8/3/11
to Brian Fraser, tpf-gsoc...@googlegroups.com, Perl5 Porters Mailing List, Florian Ragwitz, Zefram, Father Chrysostomos, Karl Williamson
On Thu, Jul 21, 2011 at 06:07:36AM -0300, Brian Fraser wrote:
> On Fri, Jul 15, 2011 at 6:43 PM, Nicholas Clark <ni...@ccl4.org> wrote:
>
> > On Mon, Jul 11, 2011 at 04:43:03AM -0300, Brian Fraser wrote:
> >
> > Would inline functions work?
> >
> > There is this "new fangled" PERL_STATIC_INLINE as of 5.14 which gives
> > C<static inline> where possible, and C<static> on legacy compilers.
> >
> >
> I don't think so; the current functions are A in embed.fnc, so they can't be
> static. :(

Agree, they can't be static in a core *.c file.

The intent of "static inline" is that they go into one of the header files,
which means that they are visible and available to everyone. The plan is to
use them in place of the core's current macro addiction.

It happens that *nearly* every example use so far isn't in a header file. :-)

Nicholas Clark

Reply all
Reply to author
Forward
0 new messages