Intent to Prototype: Gecko's segmenter rules match with Unicode standard (UAX#14 and UAX#29)

101 views
Skip to first unread message

Makoto Kato

unread,
Aug 1, 2023, 7:21:32 AM8/1/23
to dev-pl...@mozilla.org
TLDR
We replace Gecko’s segmenter code with ICU4X [*1] ’s segmenter that is
compatible with UAX#14 [*2] and UAX#29 [*3].

Gecko's line/word segmenter was designed in pre-2000 and is one of the
oldest codes in Gecko. The Unicode Consortium published the standard
as "UAX#14 - Unicode Line Breaking Algorithm" and "UAX#29 - Unicode
Text Segmentation" for segmentation rules that cover many languages
after we did it. Unfortunately, Gecko’s segmentation isn’t compatible
with this standard. Other web browsers (WebKit and Blink) use ICU4C
for segmenter rules that are compatible with this standard, so this is
a web compatibility issue.

Now, Amazon, Google and Mozilla are working on ICU4X, which is Rust
crates for I18N. Specifically, I and Ting-Yu Lin are working on a new
segmenter crate in ICU4X. We decide that we use ICU4X for this new
segmenter implementation in Gecko. It means that this is the first
integration with the ICU4X project in Gecko.

Bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1719535

Specification: https://www.unicode.org/reports/tr14/ and
https://www.unicode.org/reports/tr29/

Standards Body: The Unicode Consortium

Platform coverage: All

Preference: intl.icu4x.segmenter.enabled

DevTools bug: N/A

Other Browsers: shipped

web-platform-tests:
https://wpt.fyi/results/css/css-text/line-breaking,
https://wpt.fyi/results/css/css-text/i18n

-- Makoto Kato / :m_kato

*1 https://github.com/unicode-org/icu4x/
*2 https://www.unicode.org/reports/tr14/
*3 https://www.unicode.org/reports/tr29/
Reply all
Reply to author
Forward
0 new messages