How much do we care about libxul size vs. niche correctness?

42 views

Skip to first unread message

Henri Sivonen

unread,

Apr 3, 2023, 11:01:00 AM4/3/23

to dev-platform

Do we care enough about libxul size that we want to take an easy-to-write patch that would reduce Web-exposed correctness a little (in principle but most likely not in a user-relevant way judging from the HTTP Archive data set) in order to reduce libxul size by 152 KB?

Intl.Collator(..., {usage: "search"}) is an API for fuzzy (in the context of the Latin script, case and diacritic-insensitive) filtering a set of strings by a search key that is expected to have come from user input. The API allows for full-string matching only (not substring search), which limits the API to filtering a set of strings by testing each one against the search key, which means that the addressable use cases are pretty narrow. Based on HTTP Archive, almost all instances of calling code on the Web are traceable to a single PR: https://github.com/mapbox/mapbox-gl-js/pull/6270 .

Some Latin-script languages[1] have language-specific exceptions to diacritic-insensitivity. There exist also script-level fuzziness rules for the Arabic script (to be insensitive to certain Arabic marks) and the Thai script (to be insensitive to phinthu/virama) (and Hangul, but that's a different story; see my next email). Additionally, there is a conceptually "script-level" rule for symbols that makes the not-equals sign _not_ match the equals sign in diacritic-insensitive matching.

The language-specific rules and the script-level rules combine really badly in terms of data size. What's originally supposed to be sorting data reuse "for free" ends up growing libxul by 152 KB.

We could make libxul 152 KB smaller by not having the Arabic and Thai-script fuzziness rules (and the not-equals _un_fuzziness rule) apply when language-specific Latin-script rules are in effect.

If we were to do this, it would apply to all platforms (not just Android) so as not to add build system complications.

Eventually, we could have a size reduction without a Web-exposed behavior change by migrating to ICU4X *and* implementing https://github.com/unicode-org/icu4x/issues/3178 on the ICU4X side.

[1]:
* Azeri
* Catalan
* Danish
* Faroese
* Finnish
* German
* Greenlandic
* Hungarian
* Icelandic
* Inari Sámi
* North Sámi
* Norwegian
* Slovak
* Spanish
* Swedish
* Turkish

Notably, English (the common fallback language) and French (used in a number of places where Arabic is also used) are _not_ on this list and, therefore, would still get the Arabic-script and Thai-script fuzziness.

--

Henri Sivonen
hsiv...@mozilla.com

Reply all

Reply to author

Forward

0 new messages