As it is now, I'm having to put in extra checks to watch for
the end of the string when I'm just wanting to run along a
string of digits. Of course, if we had an op to find the
offset of the first non-digit/non-word/non-whitespace/etc.
codepoint then I could use that. :-)
I'll be glad to write the patch for is_digit and friends
if it's appropriate. Otherwise I'll just continue to work
around the current behavior with the extra checks for end of string.
Pm
Sounds reasonable. We can change that along with the proposed
C<is_cclass> implementation.
> As it is now, I'm having to put in extra checks to watch for
> the end of the string when I'm just wanting to run along a
> string of digits. Of course, if we had an op to find the
> offset of the first non-digit/non-word/non-whitespace/etc.
> codepoint then I could use that. :-)
Well, then working on the general find and find_not classifying opcodes
and implementation seems to be more effective, then ...
> I'll be glad to write the patch for is_digit and friends
> if it's appropriate.
... reparing doomed code :-)
> Pm
leo
> > As it is now, I'm having to put in extra checks to watch for
> > the end of the string when I'm just wanting to run along a
> > string of digits. Of course, if we had an op to find the
> > offset of the first non-digit/non-word/non-whitespace/etc.
> > codepoint then I could use that. :-)
>
> Well, then working on the general find and find_not classifying opcodes
> and implementation seems to be more effective, then ...
>
> > I'll be glad to write the patch for is_digit and friends
> > if it's appropriate.
>
> ... reparing doomed code :-)
>
> > Pm
>
> leo
jens
C<is_cclass> is now implemented. The second part (find_(not_)cclass) will
follow tomorrow.
jens
PS: The "get_byte past the end of the buffer (1 of 1)" error is hopefully
fixed, too. Don't know if it is the cleanest way, but it at least works :-)
Wow, yippee -- that is fast! There were several times yesterday
when I really wished for find_not_cclass, so this will really
clean up (and speed up) the PGE implementation. Thanks!
Pm
The attached patch file adjusts C<is_cclass> to always return false
for offsets beyond the end of the string, and updates
t/op/string_cclass.t to test this.
Pm
Wow, that's fast
> The attached patch file adjusts C<is_cclass> to always return false
> for offsets beyond the end of the string, and updates
> t/op/string_cclass.t to test this.
I'll leave the patch to Jens.
As a side note:
If ENCODING_GET_CODEPOINT (and it's equivalent get_byte) returns 0x0 at
the very end of the string (and throws an exception beyond it) would
this catch the desired end-of-string semantics for character classes?
leo
find_cclass and find_not_cclass are in now.
jens
> Pm
jens
This is *excellent*.
However, now that I look at things, I'm wondering if a slight
change to the specification would be in order -- in my original
post I said that find_cclass and find_not_cclass would return -1
whenever they (didn't find | found) the character of interest --
perhaps it would be better for them to return the length of
the string instead?
This might make it easier to do things like:
.local int pos
.local string token
$I0 = find_not_cclass .CCLASS_WORD, $S0, pos
$I0 -= pos
token = $S0
whereas if we return -1 on "not found", we have to do some funny
checking for it (rather than getting a nice null string).
Note that we can still easily check for the existence of the char...
.local int pos
.local string token
$I2 = length $S0
$I0 = find_not_cclass .CCLASS_WORD, $S0, pos
if $I0 == $I2 goto end_of_string_reached
If you're in agreement, I'll create/submit a patch and update
the test files. Best to decide this now before too many people
start using it. :-)
Pm
> IMO, we should deprecate the old find_* ops.
> It's a lot of (more or less) duplicate code, and not easy to maintain.
Yep, as well as the old is_foo opcodes and interfaces.
> jens
leo
jens
...well, in looking at it some more it's reasonable until I see
that returning -1 is the way the other find_* ops work. So,
part of me thinks we should either be consistent with those, or
make the others consistent with the interpretation I gave above, or
rename the find_cclass and find_not_cclass ops to something different
(perhaps "span_cclass" and "span_not_cclass") so as to avoid
confusion, or deprecate the pre-existing find_* ops.
Any suggestions from the peanut gallery about how this should
work? If I were designing for the long run I'd make the change,
but that's really not my decision to make, and I'll happily live
with any decision made as long as there's some sort of character
class support (and will likely generate+submit a patch for it).
I only bring this up because I'm asking myself what we'll want in
the long run, as opposed to what we might have now .
Some notes about the motivation for the change below...
Pm
-----
Why find_not_cclass is useful, and why it's useful to return the
length of string instead of an error (-1) indicator:
As many of you know, I've written the Perl Grammar Engine, and I'm
also working on another parser. In parsing expressions there are
many times where I want to be able to skip over a sequence of
characters until I find one that is not in a certain class --
i.e., for grabbing sequences of digits, word characters, skipping
whitespace, etc.
For example, before having find_cclass and find_not_cclass, grabbing
a sequence of digits in Parrot required a loop, as in:
.local string target # string to scan
.local int pos # current scanning position
$I0 = pos # keep start of position
digits_loop:
$I1 = is_digit target, pos
unless $I1 goto digits_end
inc pos
goto digits_loop
digits_end:
$I1 = pos - $I0
$S0 = substr target, $I0, $I1 # extract sequence of digits
Having find_not_cclass can make this much simpler
.local string target # string to scan
.local int pos # current scanning position
$I0 = pos # keep start of position
pos = find_not_cclass .CCLASS_NUMERIC, target, pos
$I1 = pos - $I0
$S0 = substr target, $I0, $I1 # extract sequence of digits
Similarly, I often want to skip comments (e.g., '#' until the next
newline):
.local string target # string to scan
.local int pos # current scanning position
$S0 = substr target, pos, 1
unless $S0 == "#" goto comment_skipped
inc pos
pos = find_cclass .CCLASS_NEWLINE, target, pos
comment_skipped:
# ...
Unfortunately, if find_cclass and find_not_cclass return -1 when the
desired character isn't available, I have to check for this special
return value and handle it accordingly:
.local string target # string to scan
.local int pos # current scanning position
$I0 = pos # keep start of position
pos = find_not_cclass .CCLASS_NUMERIC, target, pos
unless pos == -1 goto digit_1
pos = length target
digit_1:
$I1 = pos - $I0
$S0 = substr target, $I0, $I1 # extract sequence of digits
and
.local string target # string to scan
.local int pos # current scanning position
$S0 = substr target, pos, 1
unless $S0 == "#" goto comment_skipped
inc pos
pos = find_cclass .CCLASS_NEWLINE, target, pos
unless pos == -1 goto comment_skipped
pos = length target
comment_skipped:
# ...
So, in these particular instances, receiving the length of the
string is much more useful than the -1 value. Of course, the -1
value is a quick indicator that the desired character class
wasn't found; without it we have to check the return value
against the string length if we want to know if the character
position wasn't located. For example, instead of:
$I1 = find_not_cclass .CCLASS_DIGIT, $S0, $I0
if $I1 == -1 goto only_digits_left
with find_char returning the length of string it becomes
$I2 = length $S0
$I1 = find_not_cclass .CCLASS_DIGIT, $S0, $I0
if $I1 == $I2 goto only_digits_left
In the things I'm writing thus far, I'm having to keep the
length of the string in a register anyway to know when the
scanner has reached the end of input. So, for me, using
the string length to indicate that the desired character
isn't found doesn't cost me anything over the -1 value.
What's best for the long term?
> Thanks!
>
> Pm