Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Unicode::Collate string replacements and case sensitivity

1 view
Skip to first unread message

Frank Müller

unread,
Apr 28, 2011, 1:06:58 PM4/28/11
to perl-u...@perl.org
dear all,
I'm trying to do some string replacements with Unicode::Collate which
usually work very well, but these replacements seem to be case
insensitive by default - how can I change this? look at this simple
example:

my $myCollator = Unicode::Collate->new( normalization => undef, level
=> 1 );
my $str = "Camel camel donkey zebra came\x{301}l CAMEL horse
cAmEL...";
$myCollator->gsubst($str, "camel", sub { "#$_[0]#" });

which makes the following replacements:

#Camel# #camel# donkey zebra #camél# #CAMEL# horse #cAmEL#...

what I would love to see is the following result:

Camel #camel# donkey zebra #camél# CAMEL horse cAmEL...

As there doesn't seem to be gsubst for case sensitive and gisubst for
case insensitive string replacements, what would a solution look like?

Thanks a lot for any suggestions,
Frank

SADAHIRO Tomoyuki

unread,
May 5, 2011, 9:06:44 AM5/5/11
to Frank Müller, perl-u...@perl.org

As (level => 1) is not default, (level => 3) is also allowed for case
sensitive matching. But UCA thinks accent difference (level 2) is
more important than case difference (level 3), then camél won't
match camel when (level => 3).

level 1: camel matches camél and Camel.
level 2: camel matches Camel but not camél.
level 3: camel matches neither Camel nor camél.
--Even at level 3, it isn't so strict:
camel matches "c-a-m-e-l", "ca mel", etc.
since punctuation difference is level 4.

To make camel match camél but not Camel, other workwround is
need. In next release, a new parameter (ignore_level2) will allow it.
(However the behavior of ignore_level2 is quite different from
so-called caseLevel in UCA etc.)

Regards,
SADAHIRO Tomoyuki

0 new messages