why golang unicode katakana table don't include 0x30fc ..

231 views
Skip to first unread message

zhuoy...@gmail.com

unread,
Dec 10, 2019, 3:22:38 AM12/10/19
to golang-nuts

Rob Pike

unread,
Dec 10, 2019, 6:26:19 AM12/10/19
to zhuoy...@gmail.com, golang-nuts
That's a good question and I haven't figured it out, but I bet it has to do with U+30fb not being in L class:

% unicode -d -U 30fa 30fb 30fc

U+30FA 'ヺ' KATAKANA LETTER VO

category: Lo

canonical combining classes: 0

bidirectional category: L

character decomposition mapping: 30F2 3099

mirrored: N

U+30FB '・' KATAKANA MIDDLE DOT

category: Po

canonical combining classes: 0

bidirectional category: ON

mirrored: N

U+30FC 'ー' KATAKANA-HIRAGANA PROLONGED SOUND MARK

category: Lm

canonical combining classes: 0

bidirectional category: L

mirrored: N

%


I think the behavior might be a bug, but the character is peculiar, or at least punctuation rather than "letters". Leaving for Marcel, who is the curator of the Unicode packages these days.


-rob




--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/f524ca6d-c98e-43d5-a594-41c3c9481e18%40googlegroups.com.

zhuoy...@gmail.com

unread,
Dec 10, 2019, 7:18:39 AM12/10/19
to golang-nuts
xx
a big surprise for me to get the answer directly from rob pike.
thanks a lot.........!!
My fellows and I are very curious about this question.
hope to get a reply from Marcel.


Ben Bullock

unread,
Dec 10, 2019, 7:19:52 PM12/10/19
to golang-nuts
These properties come from the Unicode definitions in the file Scripts.txt, not from the Go language. It is the same in Perl, \p{Katakana} does not match U+30FB or U+30FC, but \p{InKatakana} does, similarly with U+30A0.

Here is the relevant portion of Scripts.txt:

30A1..30FA    ; Katakana # Lo  [90] KATAKANA LETTER SMALL A..KATAKANA LETTER VO
30FD..30FE    ; Katakana # Lm   [2] KATAKANA ITERATION MARK..KATAKANA VOICED ITERATION MARK
30FF          ; Katakana # Lo       KATAKANA DIGRAPH KOTO
31F0..31FF    ; Katakana # Lo  [16] KATAKANA LETTER SMALL KU..KATAKANA LETTER SMALL RO
32D0..32FE    ; Katakana # So  [47] CIRCLED KATAKANA A..CIRCLED KATAKANA WO
3300..3357    ; Katakana # So  [88] SQUARE APAATO..SQUARE WATTO
FF66..FF6F    ; Katakana # Lo  [10] HALFWIDTH KATAKANA LETTER WO..HALFWIDTH KATAKANA LETTER SMALL TU
FF71..FF9D    ; Katakana # Lo  [45] HALFWIDTH KATAKANA LETTER A..HALFWIDTH KATAKANA LETTER N
1B000         ; Katakana # Lo       KATAKANA LETTER ARCHAIC E
1B164..1B167  ; Katakana # Lo   [4] KATAKANA LETTER SMALL WI..KATAKANA LETTER SMALL N


I imagine that the reason for this is that U+30FC, the KATAKANA-HIRAGANA PROLONGED SOUND, isn't specifically a katakana symbol, it can be used with either katakana or hiragana (らーめん etc.), and U+30FB, although it's called KATAKANA MIDDLE DOT, is actually a punctuation mark and it also is not actually a katakana symbol. 

But the Unicode definitions are not easy to work with for people handling Japanese text. Generally speaking, if you want to match a Japanese word you want to get U+30FC, but you don't want U+30FB, which is why I made something like this:



Reply all
Reply to author
Forward
0 new messages