> I'd like to match words that consist of Unicode letters, except for CJK characters. CJK characters should get recognized separately and individually. A Unicode letter is any character where the Java Character.isLetter(ch) is true. Two questions:
>
> 1. How do I recognize CJK? In JFlex, I did it using numeric character ranges. I'd prefer to use the named ranges supported by Antlr. My JFlex code:
>
> cjk=[\u4E00-\u9FCF]|[\u3400-\u4DBF]|[\uF900-\uFAFF]|[\u3190-\u319F]|[\u2E80-\u2EFF]|[\u2F00-\u2FdF]|[\u31C0-\u31EF]|[\u3100-\u312F]|[\u31A0-\u31BF]|[\u3040-\u309F]|[\u30A0-\u30FF]|[\u31F0-\u31FF]|[\uAC00-\uD7AF]|[\u1100-\u11FF]|[\u3130-\u318F]|[\uA000-\uA48F]|[\uA490-\uA4CF]|[\uFF65-\uFF9F]|[\uFFA0-\uFFDC]
You can still do that in ANTLR4, but it's cumbersome.
>
> 2. How do I recognize every character which is a letter and not CJK? In JFlex, I did this:
>
> /* this translates to ([:letter:] and (not {cjk})) */
> letter=!(![:letter:] | {cjk})
>
> This syntax doesn't work in Antlr. I know that I can use this to recognize a letter, but I don't know how to exclude CJK.
>
> LETTER : [\p{L}];
Hmm, you could use [\P{CJK_Unified_Ideographs}], but that gives you *all* but these, even non-letter chars. There is no support for logical operators in ANTLR (other than the | operator, which could be read as OR).
But you could define 2 rules such that rule order kicks in and does the right thing for you:
CJK: [\p{CJK_Unified_Ideographs}\p{CJK_Compatibility} ... etc.];
LETTER : [\p{L}];
Since CJK comes first it will match all CJK code points, not LETTER.
Mike
--
www.soft-gems.net