# $chunk stores the text body
$sentenses = $chunk.split(/。|?|!/)
# now $sentenses holds the list of sentences.
By when I checked the result, I found some of the sentenses didn't
split well. For instance, here is a sentense:
"你没病,他呢?" (means "You are not sick, how about him?") . In
GB2312, "病," is encoded to (hex) b2a1 a3ac, and "。" happens to be
encoded to (hex) a1a3. So the String#split method finds there is a
"。" in the middle of the sentense and incorrectly do the splitting.
Certainly this is because the String#split (and the Ruby regex
engine) is byte-oriented instead of true character-oriented, and it's a
frequent problem in i18n domain. Is there any ways in Ruby to correct
split Chinese text?
Thanks in advance.
myan
>From: "Mike Meng" <meng...@gmail.com>
>Reply-To: ruby...@ruby-lang.org
>To: ruby...@ruby-lang.org (ruby-talk ML)
>Subject: Encounter troubles with Regex in Chinese text splitting
>Date: Sat, 3 Dec 2005 14:42:31 +0900
>
>Hi All,
> I'm a Ruby newbie. I'm writting a program to process a big chunk of
>Chinese text. The first step is to split the chunk of text into a list
>of sentences. In Chinese, all the characters are listed one by one
>without any natural boundary tag like space in English. Sentences are
>separated by one of three special characters(?ゑシ滂シ?. So at the
>first glance, I thought it's a simple task:
>
># $chunk stores the text body
>$sentenses = $chunk.split(/??�シ?�シ?)
># now $sentenses holds the list of sentences.
>
> By when I checked the result, I found some of the sentenses didn't
>split well. For instance, here is a sentense:
>"菴豐。?��シ御サ門造�シ?quot; (means "You are not sick, how about him?") . In
>GB2312, "?��シ� is encoded to (hex) b2a1 a3ac, and "??quot; happens to be
>encoded to (hex) a1a3. So the String#split method finds there is a
>"??quot; in the middle of the sentense and incorrectly do the splitting.
>
> Certainly this is because the String#split (and the Ruby regex
>engine) is byte-oriented instead of true character-oriented, and it's a
>frequent problem in i18n domain. Is there any ways in Ruby to correct
>split Chinese text?
>
> Thanks in advance.
>
> myan
>
>
Try the script with $KCODE = "E"
Hope this help,
Park Heesob
Could you please tell me the reason and where can I find relevant
documents?
Thank you.
myan
$KCODE is the character coding system Ruby handles. If the first character
of $KCODE is `e' or `E', Ruby handles EUC. If it is `s' or `S', Ruby handles
Shift_JIS. If it is `u' or `U', Ruby handles UTF-8. If it is `n' or `N',
Ruby doesn't handle multi-byte characters. The default value is "NONE".
Regards,
Park Heesob