Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

unicode classes vs c/posix ctype classes

1 view
Skip to first unread message

Rich Felker

unread,
Feb 7, 2006, 2:01:00 AM2/7/06
to
I'm trying to decide on the correct way to assign ctype classes to
UCS, and not sure if there's any consensus on the correct way. My idea
so far is:

Lu -> upper
Ll -> lower
Lt -> alpha
Lm -> alpha
Lo -> alpha

Mn -> alpha
Mc -> alpha
Me -> (none) ???

Nd -> digit
Nl -> (none)
No -> (none)

Zs -> space
Zl -> space
Zp -> space

(only space and tab) -> blank

Cc -> cntrl
Cf -> cntrl (???)
Cs -> (n/a)
Co -> (none)
Cn -> (none)

P?,S? -> punct

The big questions are:

1. Should all Mn/Mc (modifier nonspacing/combining) characters be in
class alpha?

Most certainly _some_ of them need to be, since otherwise [:alpha:]+
won't match even a whole word in most South Asian scripts, and of
course these scripts won't be allowed in contexts where only
alphanumeric characters are valid. One problem with no easy solution
is that the initial character of an alphanumeric data item should be
restricted to noncombining characters for most applications, but the
ctype system has no means to enforce this without introducing new
types (although wcwidth could be used).

2. Should digit characters outside of ascii 0-9 be classified as
digits?

My feeling is that in principle they should, but it may cause lots of
problems... If they are classified as digits, does this imply that
strtol, etc. must accept them?

I'm aware that glibc and uClibc both exclude non-Latin 0-9 digits from
the digit class, but this doesn't mean it's the correct behavior.

3. Should characters other than ASCII space and tab be included in
blank?

My feeling is no, since the 'blank' ctype is intended for parsing
fields in text-format data/config files. The 'space' class is more
appropriate if you want to use it for word breaking, etc. (This raises
another question: should non-breaking space be considered a space
character? What about zero-width space, word joiner, etc.?)

4. Are Me (modifier enclosing) characters ever used for actual
alphabetic purposes (spelling words/names), or just nonsense like
drawing circles around letters?

If some are needed for alphabetic purposes, I suppose at least those
must be included in class alpha.

Anyone on this list have strong opinions on these issues, or know of a
place where I can find archived discussion, precedents, normative
documents on the matter, etc.?

Rich

--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/


Rich Felker

unread,
Feb 7, 2006, 1:57:25 PM2/7/06
to
On Tue, Feb 07, 2006 at 01:30:27PM +0100, Bruno Haible wrote:
> Rich Felker wrote:
> > I'm trying to decide on the correct way to assign ctype classes to
> > UCS, and not sure if there's any consensus on the correct way.
>
> There's certainly some amount of judgement involved. The way it's done
> in glibc is found in glibc/localedata/gen-unicode-ctype.c.

OK, thanks, I'll have a look. BTW, is there a reason you used C for
this rather than a several-line sed script to apply the 'corrections'
to UnicodeData.txt?

BTW, glibc seems to be highly incorrect on isalpha. Basically any word
in most South Asian languages is nonalphabetic according to the rules
as I read them, due to excluding combining letters. The only correct
one is Thai where you've included exceptions.

> > Lu -> upper
> > Ll -> lower
>

> I think this need to take into account the towupper and towlower mappings.

I don't see how this is so. Classifying a character as upper/lower is
much more general than case mappings, since some relationships cannot
be represented with case mappings. I don't see anywhere that ISO C or
POSIX requires toupper to change a character in order for that
character to be considered lowercase, or vice versa.

> > Lt -> alpha
> > Lm -> alpha
> > Lo -> alpha
>

> There are a couple of special cases to be considered here.

..like? just the errors in Thai?

> > Nd -> digit
>
> If you do that, the resulting locale is not ISO C 99 compliant.

Thanks for this info. I guess that answers the question. :)

> > Zs -> space
> > Zl -> space
> > Zp -> space
>

> U+00A0 shouldn't be treated like a space.

Yes, I asked about that below. Thanks for the answer. Are there other
'space' characters that should not be treated as a space?

> > Cf -> cntrl (???)
>
> I wouldn't do so. Many programs use iscntrl() as a test whether to drop
> a character from the output. Cf class characters shouldn't be dropped.

Good point. Then should they be printable but non-graphic? Or totally
unclassified?

> > 3. Should characters other than ASCII space and tab be included in
> > blank?
>

> This is a muddy area.

:)

> > 4. Are Me (modifier enclosing) characters ever used for actual
> > alphabetic purposes (spelling words/names), or just nonsense like
> > drawing circles around letters?
>

> There is also:
> 0488;COMBINING CYRILLIC HUNDRED THOUSANDS SIGN;Me;0;NSM;;;;;N;;;;;
> 0489;COMBINING CYRILLIC MILLIONS SIGN;Me;0;NSM;;;;;N;;;;;

These are non-alphabetic, right?

> 06DE;ARABIC START OF RUB EL HIZB;Me;0;NSM;;;;;N;;;;;

This seems to be just an annotation mark, but it's grouped among other
annotation marks of other combining classes so I suppose it would be
bad to treat them differently.

Rich Felker

unread,
Feb 7, 2006, 4:53:04 PM2/7/06
to
To answer some of my own questions and elaborate...

On Tue, Feb 07, 2006 at 02:01:00AM -0500, Rich Felker wrote:
> 1. Should all Mn/Mc (modifier nonspacing/combining) characters be in
> class alpha?
>
> Most certainly _some_ of them need to be, since otherwise [:alpha:]+
> won't match even a whole word in most South Asian scripts, and of
> course these scripts won't be allowed in contexts where only
> alphanumeric characters are valid. One problem with no easy solution
> is that the initial character of an alphanumeric data item should be
> restricted to noncombining characters for most applications, but the
> ctype system has no means to enforce this without introducing new
> types (although wcwidth could be used).

A few ideas.

Solution 1, a horrible hack, is thankfully forbidden. This would be to
include all combining characters in the class alnum but not alpha,
which gives the correct semantics for alphanumeric identifier fields
([[:alpha:]][[:alnum:]]*).

Solution 2 would be to exclude combining letters from alpha and have a
separate class mark. Alphabetic names/identifiers would then be
[[:alpha:]][[:alpha:][:mark:]]*. Unfortunately this would require all
applications to support a nonstandard ctype for the purpose of
matching valid names.

Solution 3 is to ignore the fact that an initial combining mark is
somehow bad, and include combining marks directly in class alpha.

There are also variations 2a and 3a, which include _only_ the
alphabetic marks in class alpha. The problem with these is that it's
very difficult to decide which marks are alphabetic (aside from SA
scripts with combining letters) due to the fact that accent marks,
etc. can be used in both alphabetic and nonalphabetic ways. Moreover,
it seems the class 'alpha' already needs to include a great deal of
nonalphabetic characters anyway, since non-Latin digits are excluded
from class 'digit'.

Thus I think it's clear that in either solution 2 or 3, at least the
majority of the combining characters should be included. A question
remains whether there are /purely/ punctuational combining marks that
should be classified as punctuation rather than alphanumeric.

0 new messages