Intent to Prototype: Gecko's segmenter rules match with Unicode standard (UAX#14 and UAX#29)
101 views
Skip to first unread message
Makoto Kato
unread,
Aug 1, 2023, 7:21:32 AM8/1/23
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to dev-pl...@mozilla.org
TLDR
We replace Gecko’s segmenter code with ICU4X [*1] ’s segmenter that is
compatible with UAX#14 [*2] and UAX#29 [*3].
Gecko's line/word segmenter was designed in pre-2000 and is one of the
oldest codes in Gecko. The Unicode Consortium published the standard
as "UAX#14 - Unicode Line Breaking Algorithm" and "UAX#29 - Unicode
Text Segmentation" for segmentation rules that cover many languages
after we did it. Unfortunately, Gecko’s segmentation isn’t compatible
with this standard. Other web browsers (WebKit and Blink) use ICU4C
for segmenter rules that are compatible with this standard, so this is
a web compatibility issue.
Now, Amazon, Google and Mozilla are working on ICU4X, which is Rust
crates for I18N. Specifically, I and Ting-Yu Lin are working on a new
segmenter crate in ICU4X. We decide that we use ICU4X for this new
segmenter implementation in Gecko. It means that this is the first
integration with the ICU4X project in Gecko.