Match CJK characters (Chinese-Japanese-Korean)

54 views
Skip to first unread message

ccl...@dieselpoint.com

unread,
Aug 23, 2017, 8:59:36 PM8/23/17
to antlr-discussion
I'd like to match words that consist of Unicode letters, except for CJK characters. CJK characters should get recognized separately and individually. A Unicode letter is any character where the Java Character.isLetter(ch) is true. Two questions:

1. How do I recognize CJK? In JFlex, I did it using numeric character ranges. I'd prefer to use the named ranges supported by Antlr. My JFlex code:

cjk=[\u4E00-\u9FCF]|[\u3400-\u4DBF]|[\uF900-\uFAFF]|[\u3190-\u319F]|[\u2E80-\u2EFF]|[\u2F00-\u2FdF]|[\u31C0-\u31EF]|[\u3100-\u312F]|[\u31A0-\u31BF]|[\u3040-\u309F]|[\u30A0-\u30FF]|[\u31F0-\u31FF]|[\uAC00-\uD7AF]|[\u1100-\u11FF]|[\u3130-\u318F]|[\uA000-\uA48F]|[\uA490-\uA4CF]|[\uFF65-\uFF9F]|[\uFFA0-\uFFDC]

2. How do I recognize every character which is a letter and not CJK? In JFlex, I did this:

/* this translates to ([:letter:] and (not {cjk})) */
letter=!(![:letter:] | {cjk})

This syntax doesn't work in Antlr. I know that I can use this to recognize a letter, but I don't know how to exclude CJK.

LETTER :    [\p{L}];


Mike Lischke

unread,
Aug 24, 2017, 3:57:18 AM8/24/17
to antlr-di...@googlegroups.com

> I'd like to match words that consist of Unicode letters, except for CJK characters. CJK characters should get recognized separately and individually. A Unicode letter is any character where the Java Character.isLetter(ch) is true. Two questions:
>
> 1. How do I recognize CJK? In JFlex, I did it using numeric character ranges. I'd prefer to use the named ranges supported by Antlr. My JFlex code:
>
> cjk=[\u4E00-\u9FCF]|[\u3400-\u4DBF]|[\uF900-\uFAFF]|[\u3190-\u319F]|[\u2E80-\u2EFF]|[\u2F00-\u2FdF]|[\u31C0-\u31EF]|[\u3100-\u312F]|[\u31A0-\u31BF]|[\u3040-\u309F]|[\u30A0-\u30FF]|[\u31F0-\u31FF]|[\uAC00-\uD7AF]|[\u1100-\u11FF]|[\u3130-\u318F]|[\uA000-\uA48F]|[\uA490-\uA4CF]|[\uFF65-\uFF9F]|[\uFFA0-\uFFDC]

You can still do that in ANTLR4, but it's cumbersome.

>
> 2. How do I recognize every character which is a letter and not CJK? In JFlex, I did this:
>
> /* this translates to ([:letter:] and (not {cjk})) */
> letter=!(![:letter:] | {cjk})
>
> This syntax doesn't work in Antlr. I know that I can use this to recognize a letter, but I don't know how to exclude CJK.
>
> LETTER : [\p{L}];

Hmm, you could use [\P{CJK_Unified_Ideographs}], but that gives you *all* but these, even non-letter chars. There is no support for logical operators in ANTLR (other than the | operator, which could be read as OR).

But you could define 2 rules such that rule order kicks in and does the right thing for you:

CJK: [\p{CJK_Unified_Ideographs}\p{CJK_Compatibility} ... etc.];
LETTER : [\p{L}];

Since CJK comes first it will match all CJK code points, not LETTER.

Mike
--
www.soft-gems.net

ccl...@dieselpoint.com

unread,
Aug 24, 2017, 1:30:32 PM8/24/17
to antlr-discussion
Unfortunately, the CJK-before-LETTER trick only works for single characters, not for words. If I do this:

CJK :  [\p{InCJK_Unified_Ideographs}];
LETTER :    [\p{L}];
WORD : LETTER+;

and then lex a sequence of CJK characters, they will get recognized as a WORD, not as individual CJK characters. Somehow I need to define LETTER in such a way that it excludes CJK.

Mike Lischke

unread,
Aug 25, 2017, 3:19:39 AM8/25/17
to antlr-di...@googlegroups.com
> Unfortunately, the CJK-before-LETTER trick only works for single characters, not for words. If I do this:
>
> CJK : [\p{InCJK_Unified_Ideographs}];
> LETTER : [\p{L}];
> WORD : LETTER+;
>
> and then lex a sequence of CJK characters, they will get recognized as a WORD, not as individual CJK characters. Somehow I need to define LETTER in such a way that it excludes CJK.


That's a bit surprising, but maybe the "the longer match wins" rule kicks in here. WORD matches more than CJK...

Another approach could be to use a predicate:

LETTER: [\p{L}] {isNotCJK(getText())}?;

where `isNotCJK` is a function on your parser which does the CJK check and return true for non-CJK letters. Instead of `getText()` you can also try $text (which is more language neutral, if that matters given that we have that function call).

Mike
--
www.soft-gems.net

Reply all
Reply to author
Forward
0 new messages