Re: New version of PGE released

Nicholas Clark

unread,

May 3, 2005, 4:22:11 PM5/3/05

to Patrick R. Michaud, perl6-i...@perl.org, perl6-c...@perl.org

On Tue, May 03, 2005 at 02:33:25PM -0500, Patrick R. Michaud wrote:

[snip the good bit]

Three cheers for Patrick.
Boo hiss to real life, especially when it gets in the way.

> Not yet implemented, but coming soon (rough priority order):
>
> - updated test harness/test suite
> - cut operations don't always work properly
> - subrules
> - character classes
> - interpolated variables
> - conjunctive matches
> - capture aliases
> - many, many potential optimizations

> Test, patches, comments, questions, etc. welcomed. Questions/comments
> about PGE installation and parrot issues probably belong on
> perl6-internals, modification and questions about PGE execution
> and internals probably go on perl6-compiler. Unless I hear otherwise
> I will probably announce minor changes only to perl6-compiler.

Whilst I confess that it's unlikely to be me here, if anyone has the time
to contribute some help, do you have a list of useful self-contained tasks
that people might be able to take on?

Nicholas Clark

Dan Sugalski

unread,

May 4, 2005, 12:30:48 PM5/4/05

to Patrick R. Michaud, perl6-i...@perl.org

At 10:21 AM -0500 5/4/05, Patrick R. Michaud wrote:

>On Tue, May 03, 2005 at 09:22:11PM +0100, Nicholas Clark wrote:
>>
>> Whilst I confess that it's unlikely to be me here, if anyone has the time
>> to contribute some help, do you have a list of useful self-contained tasks
>> that people might be able to take on?
>

>Actually, overnight I realized there's a relatively good-sized
>project that needs figuring out -- identifying character properties
>such as isalpha, islower, isprint, etc. Here I'll briefly sketch
>how I'd like it to work, and maybe someone enterprising can take
>things from

I'd planned on everything else going into constructed character
classes. I'd figured the named classes would correspond to the major
regex classes (things represented by \X sequences) while the
constructed classes would handle everything else and more or less
correspond to [] style sequences.

I thought I'd put in some docs to that effect, but apparently not. :(
--
Dan

--------------------------------------it's like this-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Patrick R. Michaud

unread,

May 4, 2005, 11:21:27 AM5/4/05

to perl6-i...@perl.org

On Tue, May 03, 2005 at 09:22:11PM +0100, Nicholas Clark wrote:
>

> Whilst I confess that it's unlikely to be me here, if anyone has the time
> to contribute some help, do you have a list of useful self-contained tasks
> that people might be able to take on?

Actually, overnight I realized there's a relatively good-sized

project that needs figuring out -- identifying character properties
such as isalpha, islower, isprint, etc. Here I'll briefly sketch
how I'd like it to work, and maybe someone enterprising can take

things from there for us.

Currently Parrot offers quite a few ops for character properties --
namely "is_whitespace", "is_wordchar", "is_digit", etc. and their
"find_XXX" counterparts. While these are useful, the set is also
incomplete -- at the moment I haven't found anything that let's
us find alphabetic, uppercase, lowercase, etc. properties. (If I've
just overlooked something, please point it out!)

I suppose Parrot could add a bunch of new "is_alpha", "is_upper",
"is_lower", etc. ops, but having separate opcodes for every
property actually complicates the design of PGE a fair bit
as well as makes a lot of very function-specific opcodes.
What would *really* be useful would be to have three basic opcodes:

is_cclass(out INT, in INT, in STR, in INT)
Set $1 to 1 if the codepoint of $3 at position $4 is in
the character class(es) given by $2.

find_cclass(out INT, in INT, in STR, in INT, in INT)
Set $1 to the offset of the first codepoint matching
the character class(es) given by $2 in string $3, starting
at offset $4 for up to $5 codepoints. If no matching
character is found, set $1 to -1.

find_not_cclass(out INT, in INT, in STR, in INT, in INT)
Set $1 to the offset of the first codepoint not matching
the character class(es) given by $2 in string $3, starting
at offset $4 for up to $5 codepoints. If the substring
consists entirely of matching characters, set $1 to -1.

The character classes in $2 above are given by an integer bitmask,
defined according to the following table (or something like it --
I took this table from ctype.h on my system, then added a "newline"
class):

0x0001 - uppercase char
0x0002 - lowercase char
0x0004 - alphabetic char
0x0008 - numeric character
0x0010 - hexadecimal digit
0x0020 - whitespace
0x0040 - printing
0x0080 - graphical
0x0100 - blank (i.e., SPC and TAB)
0x0200 - control character
0x0400 - punctuation character
0x0800 - alphanumeric character
0x1000 - newline character

We have 32 bits available, so we could extend this table as needed.
And EVENTUALLY we'll probably need a more general interface
to handle Unicode properties as well as character class compositions,
but I speculate that we can do those either in a library, or
(if speed is needed) we can build a "character class" PMC type
optimized for charsets and have:

is_cclass(out INT, in PMC, in STR, in INT)
find_cclass(out INT, in PMC, in STR, in INT, in INT)
find_not_cclass(out INT, in PMC, in STR, in INT, in INT)

But for now the integer representation of character classes
ought to be sufficient.

Anyway, that's another very useful self-contained task that
I'd be glad to have a volunteer for.

Pm

Patrick R. Michaud

unread,

May 4, 2005, 1:03:44 PM5/4/05

to Dan Sugalski, perl6-i...@perl.org

On Wed, May 04, 2005 at 12:30:48PM -0400, Dan Sugalski wrote:
> At 10:21 AM -0500 5/4/05, Patrick R. Michaud wrote:
> >Actually, overnight I realized there's a relatively good-sized
> >project that needs figuring out -- identifying character properties
> >such as isalpha, islower, isprint, etc. Here I'll briefly sketch
> >how I'd like it to work, and maybe someone enterprising can take
> >things from
>
> I'd planned on everything else going into constructed character
> classes. I'd figured the named classes would correspond to the major
> regex classes (things represented by \X sequences) while the
> constructed classes would handle everything else and more or less
> correspond to [] style sequences.

Makes sense. But somehow the named class versions of the ops
don't give me quite as much coverage as I'd like -- for example,
I can use "find_digit" to measure off a sequence of non-digit
characters (e.g., rx { \D* } ), but there's not a corresponding
"find_non_digit" opcode to let me measure off a set of digits
(e.g., rx { \d* } ).

We'll still need a way to make constructed character classes
for <upper>, <lower>, and the like. But I (or someone else) can
probably build that component in PIR for now, just hardcoding the ASCII or
Latin-1 tables for the time being until we come up with something
else later.

Pm

Leopold Toetsch

unread,

May 4, 2005, 3:05:13 PM5/4/05

to Patrick R. Michaud, perl6-i...@perl.org

Patrick R. Michaud wrote:

[ see below for some more ]

For hysterical raisins we actually have already two of char class
interfaces (partially) implemented, e.g.

src/string.c:

Parrot_string_is_digit(Interp *interpreter, STRING *s, INTVAL offset)

src/string_primitives.c

Parrot_char_is_digit(Interp *interpreter, UINTVAL character)

The former is covered by an opocde in ops/string.ops and is the more
useful form taking an string and an offset. The latter OTOH can call the
ICU function, if ICU is present.

To cleanup that mess, we stick to Patricks plan, which implies in no
specific order:

- implement the new opcodes, first in experimental.ops
- create an enum of the char classes in charset.h
- create the general API in that header too
- convert existing charset classifying tables to the new bits
- move the ICU functions to charset/unicode.c
- deprecate existing opcodes and APIs
- cleanup string_primitives.*
- convert existing tests
- write new tests
- write more news tests
- all I've forgotten to list

See also: src/ string.c string_primitives.c
include/parrot/ charset.h string_primitives.h string_funcs.h
charset/ *.c *.h [1]
ops/ string.ops
t op/string_cs.t

[1] especially char typetable[] and usage of it

> Anyway, that's another very useful self-contained task that
> I'd be glad to have a volunteer for.

Yep.

> Pm

leo