why golang unicode katakana table don't include 0x30fc ..

zhuoy...@gmail.com

unread,

Dec 10, 2019, 3:22:38 AM12/10/19

to golang-nuts

hi :

there is a question about unicode katakana table....

thanks all for help

https://en.wikipedia.org/wiki/Katakana_(Unicode_block)

why golang split the katakana range into multirange , and don't include 0x30fb --0x30fc, it should include it..

in golang src/unicode/tables
var _Katakana = &RangeTable{ R16: []Range16{ {0x30a1, 0x30fa, 1}, {0x30fd, 0x30ff, 1}, {0x31f0, 0x31ff, 1}, {0x32d0, 0x32fe, 1}, {0x3300, 0x3357, 1}, {0xff66, 0xff6f, 1}, {0xff71, 0xff9d, 1}, }, R32: []Range32{ {0x1b000, 0x1b000, 1}, },}

Rob Pike

unread,

Dec 10, 2019, 6:26:19 AM12/10/19

to zhuoy...@gmail.com, golang-nuts

That's a good question and I haven't figured it out, but I bet it has to do with U+30fb not being in L class:

% unicode -d -U 30fa 30fb 30fc

U+30FA 'ヺ' KATAKANA LETTER VO

category: Lo

canonical combining classes: 0

bidirectional category: L

character decomposition mapping: 30F2 3099

mirrored: N

U+30FB '・' KATAKANA MIDDLE DOT

category: Po

canonical combining classes: 0

bidirectional category: ON

mirrored: N

U+30FC 'ー' KATAKANA-HIRAGANA PROLONGED SOUND MARK

category: Lm

canonical combining classes: 0

bidirectional category: L

mirrored: N

%

I think the behavior might be a bug, but the character is peculiar, or at least punctuation rather than "letters". Leaving for Marcel, who is the curator of the Unicode packages these days.

-rob

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/f524ca6d-c98e-43d5-a594-41c3c9481e18%40googlegroups.com.

zhuoy...@gmail.com

unread,

Dec 10, 2019, 7:18:39 AM12/10/19

to golang-nuts

xx

a big surprise for me to get the answer directly from rob pike.

thanks a lot.........!!

My fellows and I are very curious about this question.

hope to get a reply from Marcel.

Ben Bullock

unread,

Dec 10, 2019, 7:19:52 PM12/10/19

to golang-nuts

These properties come from the Unicode definitions in the file Scripts.txt, not from the Go language. It is the same in Perl, \p{Katakana} does not match U+30FB or U+30FC, but \p{InKatakana} does, similarly with U+30A0.

Here is the relevant portion of Scripts.txt:

30A1..30FA ; Katakana # Lo [90] KATAKANA LETTER SMALL A..KATAKANA LETTER VO

30FD..30FE ; Katakana # Lm [2] KATAKANA ITERATION MARK..KATAKANA VOICED ITERATION MARK

30FF ; Katakana # Lo KATAKANA DIGRAPH KOTO

31F0..31FF ; Katakana # Lo [16] KATAKANA LETTER SMALL KU..KATAKANA LETTER SMALL RO

32D0..32FE ; Katakana # So [47] CIRCLED KATAKANA A..CIRCLED KATAKANA WO

3300..3357 ; Katakana # So [88] SQUARE APAATO..SQUARE WATTO

FF66..FF6F ; Katakana # Lo [10] HALFWIDTH KATAKANA LETTER WO..HALFWIDTH KATAKANA LETTER SMALL TU

FF71..FF9D ; Katakana # Lo [45] HALFWIDTH KATAKANA LETTER A..HALFWIDTH KATAKANA LETTER N

1B000 ; Katakana # Lo KATAKANA LETTER ARCHAIC E

1B164..1B167 ; Katakana # Lo [4] KATAKANA LETTER SMALL WI..KATAKANA LETTER SMALL N

I imagine that the reason for this is that U+30FC, the KATAKANA-HIRAGANA PROLONGED SOUND, isn't specifically a katakana symbol, it can be used with either katakana or hiragana (らーめん etc.), and U+30FB, although it's called KATAKANA MIDDLE DOT, is actually a punctuation mark and it also is not actually a katakana symbol.

But the Unicode definitions are not easy to work with for people handling Japanese text. Generally speaking, if you want to match a Japanese word you want to get U+30FC, but you don't want U+30FB, which is why I made something like this:

https://metacpan.org/pod/Lingua::JA::Moji#InKana

Reply all

Reply to author

Forward