Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

possible change to is_digit, is_wordchar, etc

1 view
Skip to first unread message

Patrick R. Michaud

unread,
May 6, 2005, 10:12:57 PM5/6/05
to perl6-i...@perl.org
I'd like to make a slight change to the is_digit, is_wordchar,
and other is_* ops. Currently calling these ops at the offset
following the last codepoint results in a
"get_byte past the end of the buffer (1 of 1)" error,
it would be nicer if they simply returned false (0) at
this one position. (Going any further than that could generate
the error message.) This would also be consistent with several
of the other string ops that don't return errors just because
the offset is at the end of the string.

As it is now, I'm having to put in extra checks to watch for
the end of the string when I'm just wanting to run along a
string of digits. Of course, if we had an op to find the
offset of the first non-digit/non-word/non-whitespace/etc.
codepoint then I could use that. :-)

I'll be glad to write the patch for is_digit and friends
if it's appropriate. Otherwise I'll just continue to work
around the current behavior with the extra checks for end of string.

Pm

Leopold Toetsch

unread,
May 7, 2005, 6:06:38 AM5/7/05
to Patrick R. Michaud, perl6-i...@perl.org
Patrick R. Michaud wrote:
> I'd like to make a slight change to the is_digit, is_wordchar,
> and other is_* ops. Currently calling these ops at the offset
> following the last codepoint results in a
> "get_byte past the end of the buffer (1 of 1)" error,
> it would be nicer if they simply returned false (0) at
> this one position. (Going any further than that could generate
> the error message.)

Sounds reasonable. We can change that along with the proposed
C<is_cclass> implementation.

> As it is now, I'm having to put in extra checks to watch for
> the end of the string when I'm just wanting to run along a
> string of digits. Of course, if we had an op to find the
> offset of the first non-digit/non-word/non-whitespace/etc.
> codepoint then I could use that. :-)

Well, then working on the general find and find_not classifying opcodes
and implementation seems to be more effective, then ...

> I'll be glad to write the patch for is_digit and friends
> if it's appropriate.

... reparing doomed code :-)

> Pm

leo

Jens Rieks

unread,
May 7, 2005, 6:09:57 AM5/7/05
to perl6-i...@perl.org, Patrick R. Michaud, Leopold Toetsch
On Saturday 07 May 2005 12:06, Leopold Toetsch wrote:
> Patrick R. Michaud wrote:
> > I'd like to make a slight change to the is_digit, is_wordchar,
> > and other is_* ops. Currently calling these ops at the offset
> > following the last codepoint results in a
> > "get_byte past the end of the buffer (1 of 1)" error,
> > it would be nicer if they simply returned false (0) at
> > this one position. (Going any further than that could generate
> > the error message.)
>
> Sounds reasonable. We can change that along with the proposed
> C<is_cclass> implementation.
Okay, I'll implement it.

> > As it is now, I'm having to put in extra checks to watch for
> > the end of the string when I'm just wanting to run along a
> > string of digits. Of course, if we had an op to find the
> > offset of the first non-digit/non-word/non-whitespace/etc.
> > codepoint then I could use that. :-)
>
> Well, then working on the general find and find_not classifying opcodes
> and implementation seems to be more effective, then ...
>
> > I'll be glad to write the patch for is_digit and friends
> > if it's appropriate.
>
> ... reparing doomed code :-)
>
> > Pm
>
> leo

jens

Jens Rieks

unread,
May 8, 2005, 5:45:08 PM5/8/05
to perl6-i...@perl.org, Leopold Toetsch, Patrick R. Michaud
Hi,

C<is_cclass> is now implemented. The second part (find_(not_)cclass) will
follow tomorrow.

jens

PS: The "get_byte past the end of the buffer (1 of 1)" error is hopefully
fixed, too. Don't know if it is the cleanest way, but it at least works :-)

Patrick R. Michaud

unread,
May 8, 2005, 7:11:36 PM5/8/05
to Jens Rieks, perl6-i...@perl.org, Leopold Toetsch
On Sun, May 08, 2005 at 11:45:08PM +0200, Jens Rieks wrote:
> Hi,
>
> C<is_cclass> is now implemented. The second part (find_(not_)cclass) will
> follow tomorrow.

Wow, yippee -- that is fast! There were several times yesterday
when I really wished for find_not_cclass, so this will really
clean up (and speed up) the PGE implementation. Thanks!

Pm

Patrick R. Michaud

unread,
May 9, 2005, 1:53:32 AM5/9/05
to Jens Rieks, perl6-i...@perl.org, Leopold Toetsch
On Sun, May 08, 2005 at 11:45:08PM +0200, Jens Rieks wrote:
> Hi,
>
> C<is_cclass> is now implemented. The second part (find_(not_)cclass) will
> follow tomorrow.

The attached patch file adjusts C<is_cclass> to always return false
for offsets beyond the end of the string, and updates
t/op/string_cclass.t to test this.

Pm

cclass.patch

Leopold Toetsch

unread,
May 9, 2005, 3:49:38 AM5/9/05
to Patrick R. Michaud, Jens Rieks, perl6-i...@perl.org
Patrick R. Michaud wrote:
> On Sun, May 08, 2005 at 11:45:08PM +0200, Jens Rieks wrote:
>
>>Hi,
>>
>>C<is_cclass> is now implemented. The second part (find_(not_)cclass) will
>>follow tomorrow.

Wow, that's fast

> The attached patch file adjusts C<is_cclass> to always return false
> for offsets beyond the end of the string, and updates
> t/op/string_cclass.t to test this.

I'll leave the patch to Jens.

As a side note:
If ENCODING_GET_CODEPOINT (and it's equivalent get_byte) returns 0x0 at
the very end of the string (and throws an exception beyond it) would
this catch the desired end-of-string semantics for character classes?

leo

Jens Rieks

unread,
May 10, 2005, 11:08:49 AM5/10/05
to perl6-i...@perl.org, Patrick R. Michaud
On Monday 09 May 2005 07:53, Patrick R. Michaud wrote:
> The attached patch file adjusts C<is_cclass> to always return false
> for offsets beyond the end of the string, and updates
> t/op/string_cclass.t to test this.
Thanks, applied!

find_cclass and find_not_cclass are in now.

jens

Jens Rieks

unread,
May 10, 2005, 4:22:35 PM5/10/05
to perl6-i...@perl.org, Patrick R. Michaud
On Tuesday 10 May 2005 20:29, Patrick R. Michaud wrote:
> This is *excellent*.
>
> However, now that I look at things, I'm wondering if a slight
> change to the specification would be in order -- in my original
> post I said that find_cclass and find_not_cclass would return -1
> whenever they (didn't find | found) the character of interest --
> perhaps it would be better for them to return the length of
> the string instead?
>
> This might make it easier to do things like:
>
> .local int pos
> .local string token
>
> $I0 = find_not_cclass .CCLASS_WORD, $S0, pos
> $I0 -= pos
> token = $S0
>
> whereas if we return -1 on "not found", we have to do some funny
> checking for it (rather than getting a nice null string).
> Note that we can still easily check for the existence of the char...
>
> .local int pos
> .local string token
>
> $I2 = length $S0
> $I0 = find_not_cclass .CCLASS_WORD, $S0, pos
> if $I0 == $I2 goto end_of_string_reached
>
> If you're in agreement, I'll create/submit a patch and update
> the test files. Best to decide this now before too many people
> start using it. :-)
Yes, sounds reasonable.

> Pm
jens

Patrick R. Michaud

unread,
May 10, 2005, 2:29:32 PM5/10/05
to Jens Rieks, perl6-i...@perl.org
On Tue, May 10, 2005 at 05:08:49PM +0200, Jens Rieks wrote:

This is *excellent*.

However, now that I look at things, I'm wondering if a slight
change to the specification would be in order -- in my original
post I said that find_cclass and find_not_cclass would return -1
whenever they (didn't find | found) the character of interest --
perhaps it would be better for them to return the length of
the string instead?

This might make it easier to do things like:

.local int pos
.local string token

$I0 = find_not_cclass .CCLASS_WORD, $S0, pos
$I0 -= pos
token = $S0

whereas if we return -1 on "not found", we have to do some funny
checking for it (rather than getting a nice null string).
Note that we can still easily check for the existence of the char...

.local int pos
.local string token

$I2 = length $S0
$I0 = find_not_cclass .CCLASS_WORD, $S0, pos
if $I0 == $I2 goto end_of_string_reached

If you're in agreement, I'll create/submit a patch and update
the test files. Best to decide this now before too many people
start using it. :-)

Pm

Leopold Toetsch

unread,
May 11, 2005, 3:58:37 AM5/11/05
to Jens Rieks, perl6-i...@perl.org, Patrick R. Michaud
Jens Rieks wrote:

> IMO, we should deprecate the old find_* ops.
> It's a lot of (more or less) duplicate code, and not easy to maintain.

Yep, as well as the old is_foo opcodes and interfaces.

> jens

leo

Jens Rieks

unread,
May 11, 2005, 3:49:47 AM5/11/05
to perl6-i...@perl.org, Patrick R. Michaud
On Wednesday 11 May 2005 04:30, Patrick R. Michaud wrote:
> ...well, in looking at it some more it's reasonable until I see
> that returning -1 is the way the other find_* ops work. So,
> part of me thinks we should either be consistent with those, or
> make the others consistent with the interpretation I gave above, or
> rename the find_cclass and find_not_cclass ops to something different
> (perhaps "span_cclass" and "span_not_cclass") so as to avoid
> confusion, or deprecate the pre-existing find_* ops.

IMO, we should deprecate the old find_* ops.
It's a lot of (more or less) duplicate code, and not easy to maintain.

jens

Patrick R. Michaud

unread,
May 10, 2005, 10:30:11 PM5/10/05
to Jens Rieks, perl6-i...@perl.org
On Tue, May 10, 2005 at 10:22:35PM +0200, Jens Rieks wrote:
> On Tuesday 10 May 2005 20:29, Patrick R. Michaud wrote:
> > This is *excellent*.
> >
> > However, now that I look at things, I'm wondering if a slight
> > change to the specification would be in order -- in my original
> > post I said that find_cclass and find_not_cclass would return -1
> > whenever they (didn't find | found) the character of interest --
> > perhaps it would be better for them to return the length of
> > the string instead?
>
> Yes, sounds reasonable.

...well, in looking at it some more it's reasonable until I see
that returning -1 is the way the other find_* ops work. So,
part of me thinks we should either be consistent with those, or
make the others consistent with the interpretation I gave above, or
rename the find_cclass and find_not_cclass ops to something different
(perhaps "span_cclass" and "span_not_cclass") so as to avoid
confusion, or deprecate the pre-existing find_* ops.

Any suggestions from the peanut gallery about how this should
work? If I were designing for the long run I'd make the change,
but that's really not my decision to make, and I'll happily live
with any decision made as long as there's some sort of character
class support (and will likely generate+submit a patch for it).
I only bring this up because I'm asking myself what we'll want in
the long run, as opposed to what we might have now .

Some notes about the motivation for the change below...

Pm

-----

Why find_not_cclass is useful, and why it's useful to return the
length of string instead of an error (-1) indicator:

As many of you know, I've written the Perl Grammar Engine, and I'm
also working on another parser. In parsing expressions there are
many times where I want to be able to skip over a sequence of
characters until I find one that is not in a certain class --
i.e., for grabbing sequences of digits, word characters, skipping
whitespace, etc.

For example, before having find_cclass and find_not_cclass, grabbing
a sequence of digits in Parrot required a loop, as in:

.local string target # string to scan
.local int pos # current scanning position
$I0 = pos # keep start of position
digits_loop:
$I1 = is_digit target, pos
unless $I1 goto digits_end
inc pos
goto digits_loop
digits_end:
$I1 = pos - $I0
$S0 = substr target, $I0, $I1 # extract sequence of digits

Having find_not_cclass can make this much simpler

.local string target # string to scan
.local int pos # current scanning position
$I0 = pos # keep start of position
pos = find_not_cclass .CCLASS_NUMERIC, target, pos
$I1 = pos - $I0
$S0 = substr target, $I0, $I1 # extract sequence of digits

Similarly, I often want to skip comments (e.g., '#' until the next
newline):

.local string target # string to scan
.local int pos # current scanning position
$S0 = substr target, pos, 1
unless $S0 == "#" goto comment_skipped
inc pos
pos = find_cclass .CCLASS_NEWLINE, target, pos
comment_skipped:
# ...

Unfortunately, if find_cclass and find_not_cclass return -1 when the
desired character isn't available, I have to check for this special
return value and handle it accordingly:

.local string target # string to scan
.local int pos # current scanning position
$I0 = pos # keep start of position
pos = find_not_cclass .CCLASS_NUMERIC, target, pos
unless pos == -1 goto digit_1
pos = length target
digit_1:
$I1 = pos - $I0
$S0 = substr target, $I0, $I1 # extract sequence of digits

and

.local string target # string to scan
.local int pos # current scanning position
$S0 = substr target, pos, 1
unless $S0 == "#" goto comment_skipped
inc pos
pos = find_cclass .CCLASS_NEWLINE, target, pos
unless pos == -1 goto comment_skipped
pos = length target
comment_skipped:
# ...

So, in these particular instances, receiving the length of the
string is much more useful than the -1 value. Of course, the -1
value is a quick indicator that the desired character class
wasn't found; without it we have to check the return value
against the string length if we want to know if the character
position wasn't located. For example, instead of:

$I1 = find_not_cclass .CCLASS_DIGIT, $S0, $I0
if $I1 == -1 goto only_digits_left

with find_char returning the length of string it becomes

$I2 = length $S0
$I1 = find_not_cclass .CCLASS_DIGIT, $S0, $I0
if $I1 == $I2 goto only_digits_left

In the things I'm writing thus far, I'm having to keep the
length of the string in a register anyway to know when the
scanner has reached the end of input. So, for me, using
the string length to indicate that the desired character
isn't found doesn't cost me anything over the -1 value.

What's best for the long term?

Jens Rieks

unread,
May 12, 2005, 9:13:31 AM5/12/05
to perl6-i...@perl.org, Patrick R. Michaud
On Thursday 12 May 2005 04:58, Patrick R. Michaud wrote:
> I wrote a patch [#35410] to get find_cclass and find_not_cclass to work
> as described in my latest message (returning string length);
> do you want to review it at all or should I just apply it
> directly? (Didn't want to apply the patch without checking
> w/you first.)
Oops.
I've applied the patch, but I was in a hurry and forgot to commit it :-(
Sorry for the delay!

> Thanks!
>
> Pm

0 new messages