Word break in ideographs differ in UAX29 versus Unicode Utils?

26 views
Skip to first unread message

Kip Cole

unread,
Feb 28, 2024, 8:43:33 PMFeb 28
to CLDR Users Public Mail List
I’m finalising my personNames implementation for the Elixir language and some tests are failing in the `zh` locale when taking the initials of a name. The base issue is that Unicode utils (https://util.unicode.org/UnicodeJsps/breaks.jsp) shows no word break in “德威” (ie its treated as one word) but it does find a word break in “东升” (treated as two words).

As best I can tell, the `common/segments/root.xml` is the CLDR source of the Unicode Segmentation algorithm (UAX 29) and I can’t see a rule that would place a break in “德威”. More perplexing is that `root.xml` content for word breaks says specifically:

<!-- Otherwise, break everywhere (including around ideographs). —>

I’ve been trying to find other references to the CLDR implementation that explain why this apparent “anomaly”. The only thing I can find so far is an ancient document at https://cldr.unicode.org/development/development-process/design-proposals/specifying-text-break-variants-in-locale-ids that hints that CLDR is doing some kind of dictionary look up for CJK word breaks but I can’t find that either.

How to I work out what rules CLDR ia applying when word segmenting text, specifically Hans/Hant text?

Thanks for any help or pointers.

Markus Scherer

unread,
Feb 29, 2024, 2:40:32 PMFeb 29
to Kip Cole, Robin Leroy, CLDR Users Public Mail List, Andy Heninger
1. Isn't there always a default UAX #29 word break between ideographs?
2. When using ICU, CJ+Thai word break uses a dictionary in addition to the rules.
3. CLDR/ICU have some tailorings from vanilla UAX #29.

Mark & Robin can say more...

markus

Kip Cole

unread,
Feb 29, 2024, 2:50:46 PMFeb 29
to Markus Scherer, Robin Leroy, CLDR Users Public Mail List, Andy Heninger
Thank you Marcus.

1. Yes, UAX #29 breaks between ideographs (hence the quote from root.xml below).
2. Thanks much, I did eventually find the dictionaries which are hosted in the icu4c code.  Turns out there are dictionaries for CJ, Thai, Burmese, Khmer and Lao. Very helpful.
3. I did see the differences between vanilla UAX #29 and CLDR. Primarily the definition of $MidLetter being different in root.xml

Thanks again, —Kip
Reply all
Reply to author
Forward
0 new messages