Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Character classification functions

10 views
Skip to first unread message

Peter Gibbs

unread,
Nov 1, 2003, 10:37:10 AM11/1/03
to perl6-internals
The current chartype struct contains an is_digit function. Do we want to add
is_alpha, is_space, etc., or will a single is_ctype function, with an enum
parameter, suffice?

A single function would simplify the addition of new character classes, but
at a (small?) cost in speed. It would also keep the chartype struct smaller,
but there is unlikely to be enough of those to make any significant
difference.

Since the current prototype includes the chartype, existing functions
(eg ICU u_is<xxx>) could not be called without a wrapper function anyway,
so a single function would mean one wrapper with a switch statement,
versus individual wrappers for each class.

I prefer the single function approach, so that is what I will start
implementing if there are no timeous objections.

Regards
Peter Gibbs
EmKel Systems

Michael Scott

unread,
Nov 1, 2003, 11:59:05 AM11/1/03
to perl6-i...@perl.org
On 1 Nov 2003, at 16:37, Peter Gibbs wrote:

> The current chartype struct contains an is_digit function. Do we want
> to add
> is_alpha, is_space, etc., or will a single is_ctype function, with an
> enum
> parameter, suffice?

Excuse me for being naming fusspot for a second.

What Parrot calls a chartype is more commonly called a character set. I
mention this because it's the kind of thing you really notice when
writing documentation

http://www.vendian.org/parrot/wiki/bin/view.cgi/Main/
ParrotDiagramsString

and therefore puts me in this frame of mind.

Since the enum will specify what you yourself call character classes
can't we call the function is_charclass() instead?

BTW the related get_digit() function currently fails some test that I'm
working on. If you pass it a non-digit character it blithely calculates
from first_code and first_value. Rather, it should indicate failure in
some way.

Mike

Leopold Toetsch

unread,
Nov 1, 2003, 11:41:44 AM11/1/03
to Peter Gibbs, perl6-i...@perl.org
Peter Gibbs <pe...@emkel.co.za> wrote:

> I prefer the single function approach, so that is what I will start
> implementing if there are no timeous objections.

Yep, a single is_ctype() should really be enough.

> Regards
> Peter Gibbs

leo

Peter Gibbs

unread,
Nov 1, 2003, 3:18:20 PM11/1/03
to perl6-i...@perl.org, Michael Scott
"Michael Scott" <michae...@mac.com> wrote:

> Since the enum will specify what you yourself call character classes
> can't we call the function is_charclass() instead?

The isascii etc macros have been defined in a header called ctype.h for
some time, and glibc actually has a macro 'isctype' which does the exact
equivalent of what I am proposing, which is why I chose the name; however,
I have no personal preference, so I'll go with whatever seems most
popular. Your suggestion wins so far.

> BTW the related get_digit() function currently fails some test that I'm
> working on. If you pass it a non-digit character it blithely calculates
> from first_code and first_value. Rather, it should indicate failure in
> some way.

That depends on the definition, which I can't find anywhere. Throwing
an exception seems reasonable, so I'll do that.

> Mike
Peter

Dan Sugalski

unread,
Nov 2, 2003, 1:20:57 PM11/2/03
to Peter Gibbs, perl6-internals
At 5:37 PM +0200 11/1/03, Peter Gibbs wrote:
>The current chartype struct contains an is_digit function. Do we want to add
>is_alpha, is_space, etc., or will a single is_ctype function, with an enum
>parameter, suffice?

It'd suffice, but I'd rather not do that for the moment. When these
get called a lot the small speed hits will pile up, and the regex
engine will end up doing that often. (And if we declare that the
chartype functions are immutable we can play games with the JIT
easier at some point in the future) I think we'd be better served
with a small set of functions to detect common things and a fallback
function for the rest. We can wrap them all in macros so the
internals can do a PARROT_IS_SPACE(foo) regardless of whether
is_space is a vtable slot or uses the parameter function.

If we're feeling fancy, we can do something similar to the PMC
functions, where each required chartype function is a regular
function and we can fill in the structure appropriately depending on
which functions get their own vtable slot and which hang off the
fallback function.
--
Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

0 new messages