Word boundaries

Zbigniew Łukasiak

unread,

Mar 26, 2012, 5:03:11 AM3/26/12

to perl-u...@perl.org

For our spam classifier I need to split the text into words.
Unfortunately the '\b' regex does not yet work for languages with no
spaces (apparently it is covered in the level 3 of unicode support
http://unicode.org/reports/tr18/#Tailored_Word_Boundaries) - so I need
some custom solution. This did not seem very difficult - just split
the text into blocks of same unicode script and then use '\b' for most
of the scripts and appropriate libraries for the rest (at least for
Chinese there are some tokenizers on CPAN) - but:

1. How can I split the text into blocks of same scripts? (Wouldn't a
script-boundary regex property be useful?). OK I can always loop over
the characters, check their script and check if it is the same as the
previous one - i.e. back to C mode of programming. But then there is
still the question of:

2. How can I check what script a character belongs to? Do I need to
cut and paste all the script ranges from unicode.org into a huge
if-else branch in my program or is there a simpler way?

Thanks in advance,
Zbigniew

Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯

unread,

Mar 26, 2012, 6:57:51 AM3/26/12

to perl-u...@perl.org

> How can I check what script a character belongs to?

$ perl -Mutf8 -MUnicode::UCD=charinfo -E'say charinfo(ord
"为")->{script}'
Han

Sanity checks:

$ perl -Mutf8 -E'say "为" =~ /\p{Han}/'
1

$ uniprops -a1 为 | ack Script
Script=Han
Script=Hani

> check if it is the same as the
> previous one - i.e. back to C mode of programming.

Let the regex engine help you advance the character counter.

$ cat langs
ΕλληνικάEnglish한국어日本語Русскийไทย

----

$ cat langs.pl
use 5.010;
use strictures;
use Unicode::UCD qw(charinfo);

sub script {
return charinfo(ord substr($_[0], 0, 1))->{script}
};

# necessary because pos() magic is tracked on the scalar.
my $copy = $_;
while (/(\X)/g) {
my $script = script $1;
my ($part) = $copy =~ /(\p{$script}+)/;
say $part;
pos($_) = pos($_) + length($part);
}

----

$ perl -C -ln langs.pl < langs
Ελληνικά
English
한국어
Русский
ไทย

signature.asc

Zbigniew Łukasiak

unread,

Mar 27, 2012, 8:21:50 AM3/27/12

to Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯, perl-u...@perl.org

On Mon, Mar 26, 2012 at 12:57 PM, Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯 <da...@cpan.org> wrote:
> Let the regex engine help you advance the character counter.
>
> $ cat langs
> ΕλληνικάEnglish한국어日本語Русскийไทย
>
> ----
>
> $ cat langs.pl
> use 5.010;
> use strictures;
> use Unicode::UCD qw(charinfo);
>
> sub script {
> return charinfo(ord substr($_[0], 0, 1))->{script}
> };
>
> # necessary because pos() magic is tracked on the scalar.
> my $copy = $_;
> while (/(\X)/g) {
> my $script = script $1;
> my ($part) = $copy =~ /(\p{$script}+)/;
> say $part;
> pos($_) = pos($_) + length($part);
> }

Thanks a lot!

Here is the first version of my tokenizer based on this idea:

use Lingua::ZH::MMSEG;

sub tokenize {
my $text = shift;
my @tokens;
while ( $text =~ /(\X)/g ) {
my $part = $1;
my $script = charinfo( ord $1)->{script};
$text=~ /(\p{$script}*)/g;
next if $script eq 'Common';
$part .= $1;
if( $script eq 'Han' ){
push @tokens, mmseg( $part );
}
else{
push @tokens, $part;
}
}
return @tokens;
}

And the surprise - this works even without further splitting because
space and other dots all get the 'Common' script and are not matched
by \p{Latin}.

--
Zbigniew Lukasiak
http://brudnopis.blogspot.com/
http://perlalchemy.blogspot.com/