GSOC Status Report, Week 10

6 views
Skip to first unread message

Brian Fraser

unread,
Aug 3, 2011, 1:15:36 AM8/3/11
to tpf-gsoc...@googlegroups.com, Perl5 Porters Mailing List, Florian Ragwitz, Father Chrysostomos, Zefram, Karl Williamson
Howdy people.

Turns out cleaning up labels was a bit more work than I had anticipated; I had to add new versions of the C(op|x)LABEL() macros to get the length and the flags of the label, modify the PVOP struct and the do*label() functions to take a flags parameter, add _flags variants to newPVOP and newSTATEOP, and change LABEL's type in perly.y from p_tkval to opval, which meant modifying toke.c too.
I thought it was going to take adding a flags parameter somewhere. This was the complete opposite of attributes : (

But the real focus of the week was toke.c. First order of business was adding _(pvn|pv|sv) forms to yyerror, which allowed me to clean up a fairly big slice of error messages throughout the core - Most of op.c, for starters. A big chunk of last week was spent on getting error messages clean - As of right now, the biggest part left untouched is in the reg*.c files.
And, while cleaning error messages is generally uninteresting, two things did come up:
First, embedded nuls in heredoc delimiters are currently misreported; eval qq!<<"TEST\0HERE";\nline\n!; say $@; for example, gives nothing back in blead, and only TEST locally - I'll try to get that done, but it's not very high priority.
Second, any regex to match a valid identifier is a complete lie (Sorry, TomC!). More specifically, right now in blead, the first octet of an identifier has no restrictions! $\N{DIVISON SIGN} is currently a valid identifier; This is more or less reported in RT#69032. But, I kind of made that even worse; Since sanitizing that part of the tokenizer to respect character semantics, any _character_ following a sigil can be a valid identifier! So $\N{DROMEDARY CAMEL} is perfectly fine, as is \N{SNAKE} (However, fully qualifying any of those, even \N{DIVISON SIGN}, suddenly turns them into "unrecognized character" croaks).
So I'm (stealing the idea of) proposing adding a new exception, copied more or less verbatim from Moritz Lenz's rt ticket: "Character '%s' not allowed in identifier". I can only think of two places this could be called from, and only the former isn't contentious:
  • The end of S_scan_ident. If we reach here, and the variable d isn't null, it either contains a punctuation variable, or a character that didn't match /^\p{XIDS}/. If we are under UTF-8 semantics (so as not to suddenly break non use utf8; code), and !isASCII(*d) (since basically all of ASCII is reserved for special variables), and d !~ /^\p{XIDS}/ (which is somewhat redundant, admittedly), we croak with the error above, and show the character as \x{...}.
  • Somewhat more contentiously, in the big switch inside yylex, in the default case. Instead of croaking with "Unrecognized character", if PL_in_my is true (that is, we are inside an our/my/state declaration) and the previous character !isSPACE(), we croak with not allowed in identifier. So, for instance, 'my $x \N{DROMEDARY CAMEL}' would croak with 'unrecognized character', but 'my $x\N{DROMEDARY CAMEL}' with 'not allowed in identifier'. The admittedly quite awful lookbehind for a space is there mostly because, in a future with pluggable operators, my $x \N{ARMENIAN EXCLAMATION MARK}= 10; could very possibly be valid syntax, and the exclamation mark has nothing to do with the identifier.

Moving on, PL_multi_(close|open) are chars, not char*, so any attempt to implement RT#89032 would have to turn those to something more sensible (an SV, perhaps? through a simple three-element struct would make do just fine). Should I change it?
I also got \N{...} clean, but in looking for the relevant ticket (RT#73022), I noticed that Karl had already taken it last year. Teaches me to read the entire thread, I guess - Thankfully the fix was quite small (one line!). Karl, any reason this didn't make it to blead? Or did it just fall through the cracks?

I think that's about it for the interesting bits. Because a lot of what I did this week is severely undertested (and critically disorganized :P), I'll be putting some time on that the next couple of days, in addition to continuing with the toke.c shenanigans.

Also, a bit of a heads up, college started again for me - I don't think my schedule will change much from what it was lately though, minus for Wednesdays.

Reini Urban

unread,
Aug 3, 2011, 6:20:28 AM8/3/11
to tpf-gsoc...@googlegroups.com, Brian Fraser, Perl5 Porters Mailing List, Florian Ragwitz, Father Chrysostomos, Zefram, Karl Williamson
2011/8/3 Brian Fraser <fras...@gmail.com>:

> Turns out cleaning up labels was a bit more work than I had anticipated; I
> had to add new versions of the C(op|x)LABEL() macros to get the length and
> the flags of the label, modify the PVOP struct and the do*label() functions
> to take a flags parameter, add _flags variants to newPVOP and newSTATEOP,
> and change LABEL's type in perly.y from p_tkval to opval, which meant
> modifying toke.c too.
> I thought it was going to take adding a flags parameter somewhere. This was
> the complete opposite of attributes : (

I see the attributes work on github, but not CopLABEL's yet.

I'm also concerned about the public HEK API.
I see only functions accepting HEK's, but no functions getting those.
How should we create HEK's which need to be passed to the setters?

#if defined(PERL_IN_HV_C)
...
sa |HE* |new_he
sanR |HEK* |save_hek_flags |NN const char *str|I32 len|U32 hash|int flags

Shouldn't those be public now?
--
Reini

Nicholas Clark

unread,
Aug 3, 2011, 6:56:24 AM8/3/11
to Reini Urban, tpf-gsoc...@googlegroups.com, Brian Fraser, Perl5 Porters Mailing List, Florian Ragwitz, Father Chrysostomos, Zefram, Karl Williamson
On Wed, Aug 03, 2011 at 12:20:28PM +0200, Reini Urban wrote:

> I'm also concerned about the public HEK API.
> I see only functions accepting HEK's, but no functions getting those.
> How should we create HEK's which need to be passed to the setters?
>
> #if defined(PERL_IN_HV_C)
> ...
> sa |HE* |new_he
> sanR |HEK* |save_hek_flags |NN const char *str|I32 len|U32 hash|int flags
>
> Shouldn't those be public now?

There is this already:

Ap |HEK* |share_hek |NN const char* str|I32 len|U32 hash

which returns a HEK.

Nicholas Clark

Nicholas Clark

unread,
Aug 3, 2011, 8:14:15 AM8/3/11
to Brian Fraser, tpf-gsoc...@googlegroups.com, Perl5 Porters Mailing List, Florian Ragwitz, Father Chrysostomos, Zefram, Dave Mitchell
On Wed, Aug 03, 2011 at 02:15:36AM -0300, Brian Fraser wrote:

> Moving on, PL_multi_(close|open) are chars, not char*, so any attempt to
> implement RT#89032 would have to turn those to something more sensible (an
> SV, perhaps? through a simple three-element struct would make do just fine).
> Should I change it?

I don't see any particular problem with changing them to something with a new
name. Google codesearch suggests that nothing is using them. (Sadly the most
excellent grep.cpan.me isn't making us remember how much we value it by not
being there).

What would be in the proposed three-element struct? It's not obvious to me.
I'd be tempted to avoid SVs where one doesn't need the full "power" of SVs.

The core has a long history of (re) using the core's data types
(such as SVs for temporary buffers, AVs for pad structures and aspects of
PerlIO, HVs for the regex Unicode features) which I'm not sure has always
been the best long-term trade off, given

a: The added flexibility of these containers means that they are actually
larger than that which would be minimally necessary
(eg 3 pointers in every AV that will never be needed for a PAD)
b: They end up trying to be one-size-fits all, which may mean they aren't
always the fastest way to solve a problem (HVs inversion lists)
c: global destruction and interpreter creation memory management mean that the
things you rather hoped would be there can disappear from under you, or not
yet be ready to use

Nicholas Clark

Dave Mitchell

unread,
Aug 3, 2011, 9:24:49 AM8/3/11
to Nicholas Clark, Brian Fraser, tpf-gsoc...@googlegroups.com, Perl5 Porters Mailing List, Florian Ragwitz, Father Chrysostomos, Zefram
On Wed, Aug 03, 2011 at 01:14:15PM +0100, Nicholas Clark wrote:
> The core has a long history of (re) using the core's data types
> (such as SVs for temporary buffers, AVs for pad structures and aspects of
> PerlIO, HVs for the regex Unicode features) which I'm not sure has always
> been the best long-term trade off, given
>
> a: The added flexibility of these containers means that they are actually
> larger than that which would be minimally necessary
> (eg 3 pointers in every AV that will never be needed for a PAD)
> b: They end up trying to be one-size-fits all, which may mean they aren't
> always the fastest way to solve a problem (HVs inversion lists)
> c: global destruction and interpreter creation memory management mean that the
> things you rather hoped would be there can disappear from under you, or not
> yet be ready to use

I've often thought we need a lightweight array type that is just a

struct { int size; void *array[1]; /* chumminess */ }

for all those situations where there aren't multiple references, and we
don't mind the struct being realloced: such as the backref array.
The API could take a pointer to a pointer to the struct, so it can update
the original pointer if it needs to reallooc.

And we could probably make more use of the ptr_table API instead of using
a full hash (although I don't know whether the API would need updating to
make it more widely usable).

--
Diplomacy is telling someone to go to hell in such a way that they'll
look forward to the trip

Dave Mitchell

unread,
Aug 3, 2011, 9:33:06 AM8/3/11
to Nicholas Clark, Brian Fraser, tpf-gsoc...@googlegroups.com, Perl5 Porters Mailing List, Florian Ragwitz, Father Chrysostomos, Zefram
On Wed, Aug 03, 2011 at 01:14:15PM +0100, Nicholas Clark wrote:
> On Wed, Aug 03, 2011 at 02:15:36AM -0300, Brian Fraser wrote:
>
> > Moving on, PL_multi_(close|open) are chars, not char*, so any attempt to
> > implement RT#89032 would have to turn those to something more sensible (an
> > SV, perhaps? through a simple three-element struct would make do just fine).
> > Should I change it?
>
> I don't see any particular problem with changing them to something with a new
> name. Google codesearch suggests that nothing is using them. (Sadly the most
> excellent grep.cpan.me isn't making us remember how much we value it by not
> being there).
>
> What would be in the proposed three-element struct? It's not obvious to me.
> I'd be tempted to avoid SVs where one doesn't need the full "power" of SVs.

Note also that PL_multi_(close|open) are just back-compat aliases to
fields within the new(ish) PL_parser struct:

#define PL_multi_open (PL_parser->multi_open)
#define PL_multi_close (PL_parser->multi_close)

this may affect how you want to declare the three fields: i.e. there's no
particular reason why you need to preserve the PL_multi_open/close names
if directly referring to structure or substructure elements within
PL_parser makes more sense.

--
Dave's first rule of Opera:
If something needs saying, say it: don't warble it.

Nicholas Clark

unread,
Aug 3, 2011, 9:41:34 AM8/3/11
to Dave Mitchell, Brian Fraser, tpf-gsoc...@googlegroups.com, Perl5 Porters Mailing List, Florian Ragwitz, Father Chrysostomos, Zefram
On Wed, Aug 03, 2011 at 02:24:49PM +0100, Dave Mitchell wrote:

> I've often thought we need a lightweight array type that is just a
>
> struct { int size; void *array[1]; /* chumminess */ }
>
> for all those situations where there aren't multiple references, and we
> don't mind the struct being realloced: such as the backref array.
> The API could take a pointer to a pointer to the struct, so it can update
> the original pointer if it needs to reallooc.

I'd been wondering about

struct { STRLEN size; STRLEN allocated; } ... /* raw memory follows */

but I was thinking about various times SVs are used as byte buffers.

int troubles me because it's signed, and because it likely is only 32 bits
on a 64 bit system.

> And we could probably make more use of the ptr_table API instead of using
> a full hash (although I don't know whether the API would need updating to
> make it more widely usable).

From memory, it doesn't have a delete. It assumes that it grows, and is then
discarded.

Nicholas Clark

Brian Fraser

unread,
Aug 4, 2011, 3:53:23 AM8/4/11
to Nicholas Clark, tpf-gsoc...@googlegroups.com, Perl5 Porters Mailing List, Florian Ragwitz, Father Chrysostomos, Zefram, Dave Mitchell
On Wed, Aug 3, 2011 at 9:14 AM, Nicholas Clark <ni...@ccl4.org> wrote:
I don't see any particular problem with changing them to something with a new
name. Google codesearch suggests that nothing is using them. (Sadly the most
excellent grep.cpan.me isn't making us remember how much we value it by not
being there).

What would be in the proposed three-element struct? It's not obvious to me.
I'd be tempted to avoid SVs where one doesn't need the full "power" of SVs.


Truth be told, it doesn't really need three. Or even a struct. It could just be a char* or char[4] and work more or less fine.
But with that disclaimer out of the way, what I had in mind was simply replacing those chars with something like this:
struct {                   
    char[UTF8_MAXBYTES] delim;
    STRLEN len;
    U32 flags;
} multi_open, multi_close;

And thanks to Dave pointing out that PL_multi_(etc) are just backcompat macros, I imagine they could be redefined as (*(PL_parser->multi_etc.delim)) for the hypothetical [DC]PAN module using them.

The flags are superfluous, really. I think I was worried about \xAB and \xBB being (up|down)gradeable last night, but I can't come up with an example that isn't incredibly silly - or possible, for that matter. So make that a two-element struct.
The length member is nice to have, because it allows us to optimize for the most common case (when len == 1, we can keep the current char-oriented behavior) and stop strlens popping out here and there for the not-so-common-or-yet-possible case.

For the char[ UTF8_MAXBYTES ], I _think_ that char[3] might actually make do just fine there [*], assuming that the len member stays and all the possible paired delimiters are all on plane 0 (The latter does hold true for anything that matches /[\p{Open_Punctuation}\p{Close_Punctuation}\p{Initial_Punctuation}\p{Final_Punctuation}]/).
But that might be completely backwards, in which case please smack some sense into me. All this optimizing talk is a quite over my head for now, so you'll have to excuse my naiveté.

[*] This does stop Ms. Future Extension Writer from having qq\N{DROMEDARY CAMEL} eats the \N{SNAKE}; work by using the parser struct alone, but so does the current implementation of S_scan_str(), so nothing of value (is currently) lost, I guess?

Nicholas Clark

unread,
Aug 4, 2011, 5:24:02 AM8/4/11
to Brian Fraser, tpf-gsoc...@googlegroups.com, Perl5 Porters Mailing List, Florian Ragwitz, Father Chrysostomos, Zefram, Dave Mitchell
On Thu, Aug 04, 2011 at 04:53:23AM -0300, Brian Fraser wrote:
> On Wed, Aug 3, 2011 at 9:14 AM, Nicholas Clark <ni...@ccl4.org> wrote:
>
> > I don't see any particular problem with changing them to something with a
> > new
> > name. Google codesearch suggests that nothing is using them. (Sadly the
> > most
> > excellent grep.cpan.me isn't making us remember how much we value it by
> > not
> > being there).
> >
> > What would be in the proposed three-element struct? It's not obvious to me.
> > I'd be tempted to avoid SVs where one doesn't need the full "power" of SVs.
> >
> >
> Truth be told, it doesn't really need three. Or even a struct. It could just
> be a char* or char[4] and work more or less fine.
> But with that disclaimer out of the way, what I had in mind was simply
> replacing those chars with something like this:
> struct {
> char[UTF8_MAXBYTES] delim;
> STRLEN len;
> U32 flags;
> } multi_open, multi_close;
>
> And thanks to Dave pointing out that PL_multi_(etc) are just backcompat
> macros, I imagine they could be redefined as (*(PL_parser->multi_etc.delim))
> for the hypothetical [DC]PAN module using them.

I'd feel more comfortable removing the names, and ensuring that compile time
breakage happened for anything hypothetical, to highlight that it needs
fixing, rather than trying to bodge something that increasingly doesn't work
as well.


> The flags are superfluous, really. I think I was worried about \xAB and \xBB
> being (up|down)gradeable last night, but I can't come up with an example
> that isn't incredibly silly - or possible, for that matter. So make that a
> two-element struct.
> The length member is nice to have, because it allows us to optimize for the
> most common case (when len == 1, we can keep the current char-oriented
> behavior) and stop strlens popping out here and there for the
> not-so-common-or-yet-possible case.

I don't think strlen() wants to appear anywhere near this code, as

a: the NUL byte is a valid delimiter.
b: given that we're talking about the (well formed) UTF-8 representation of
one Unicode code point, the length is implicit in the first octet.

Which means that the "fast path" is when IS_UTF8_CHAR_1(delim) is true.

#define IS_UTF8_CHAR_1(p) \
((p)[0] <= 0x7F)

> For the char[ UTF8_MAXBYTES ], I _think_ that char[3] might actually make do
> just fine there [*], assuming that the len member stays and all the possible
> paired delimiters are all on plane 0 (The latter does hold true for anything
> that matches

I'm not an expert on this stuff, but I think that open and close are used at
all times, not just for paired delimiters:

/* if open delimiter is the close delimiter read unbridle */
if (PL_multi_open == PL_multi_close) {

So I think it needs to stay as your original suggestion of UTF8_MAXBYTES.

Nicholas Clark

Brian Fraser

unread,
Aug 7, 2011, 1:18:19 AM8/7/11
to Reini Urban, tpf-gsoc...@googlegroups.com, Perl5 Porters Mailing List, Florian Ragwitz, Father Chrysostomos, Zefram, Karl Williamson
On Wed, Aug 3, 2011 at 7:20 AM, Reini Urban <rur...@x-ray.at> wrote:

I see the attributes work on github, but not CopLABEL's yet.


Hm, I wanted to organize things a bit before pushing; Unfortunately in organizing I found a bug in the GV stuff (more on that later), so I never got around doing it; Apologies.
https://github.com/Hugmeir/hugmeir_gsoc/tree/tokemess should have most of the progress up to this report, though some tests fail due to the aforementioned GV bug.

About the bug:
First, how to trigger it (in the GV repo, not in blead):
./perl -Ilib -CS -E 'eval "our \$main::\x{30cb}"; say "[$@]";'

Doing that will leave $@ with a "Compilation error at lib/utf8_heavy.pl line 464" message, instead of the expected result. The compilation error itself comes from utf8_heavy.pl attempting a do FILE; and failing - I'm not sure why that's failing, but in any case it gets there because of some calls to isIDFIRST(_lazy)() in gv.c.

I can't figure out why the do FILE is failing, but a temporary "don't merge into blead until this is sorted out" fix would be to replace those isIDFIRST calls with an is_utf8_alnum() call, since unicore/lib/Perl/Word.pl is already loaded by the time we reach gv.c. I imagine the difference will lead to some subtle bugs, though.

Karl Williamson

unread,
Aug 10, 2011, 2:02:33 PM8/10/11
to Brian Fraser, tpf-gsoc...@googlegroups.com, Perl5 Porters Mailing List, Florian Ragwitz, Father Chrysostomos, Zefram
On 08/02/2011 11:15 PM, Brian Fraser wrote:
> I also got \N{...} clean, but in looking for the relevant ticket
> (RT#73022), I noticed that Karl had already taken it last year. Teaches
> me to read the entire thread, I guess - Thankfully the fix was quite
> small (one line!). Karl, any reason this didn't make it to blead? Or did
> it just fall through the cracks?

I think it fell through the cracks.

Reply all
Reply to author
Forward
0 new messages