Intent to unship: Korean search collations

75 views
Skip to first unread message

Henri Sivonen

unread,
Apr 3, 2023, 11:01:08 AM4/3/23
to dev-platform
Bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1739983

(See my previous email about how Intl.Collator(..., {usage: "search"}) is little used and limited in terms of use cases due to doing only full-string matching, which mainly enables filtering a set of strings by a search key rather than finding the search key as a substring or prefix of a longer string.)

For somewhat unfortunate reasons that are my fault, ICU4X doesn't properly support search collation tailoring for Hangul: In ICU4X search collations, Hangul uses sorting data, which means that to match a Hangul string, the search key has to be canonically equivalent. Hangul is special, because collation builds on normalization and Hangul is special in normalization. (FWIW, this is a looser condition than what Firefox's ctrl-f/cmd-f imposes! Firefox's ctrl-f/cmd-f requires the normalization to match as well. That is, since input methods produce Hangul in Normalization Form C, Firefox requires the page to be in NFC, too, for ctrl-f/cmd-f to work.)

I'd like to unship Korean search collations in the ICU4C context to see if implementing support for them in the ICU4X context needs to be treated as a blocker for Gecko switching its Intl.Collator back end from ICU4C to ICU4X. Given the Web API surface available (full-string matching only, no prefix or substring matching), the utility of the Korean search collations seems questionable to me, which is why it doesn't seem like a good use of engineering effort to make ICU4X support them if not supporting them turns out to be feasible.

Additionally, this reduces libxul size on aarch64 Android by 200 KB on top of the 152 KB reduction from the change contemplated in my previous email. That is, a 352 KB reduction taken together.

The patch is easy to write. If others think this is acceptable to do, I intend to pursue getting this change landed in August.


# Details

## What CLDR has

In CLDR, there are three search-specific special behavior for Hangul:

1. There exists a Korean-specific special search mode called searchjl. It matches on the lead consonant of each syllable and ignores the vowel and the possible trailing consonant of each syllable. As I understand it, this mode originates from contact name search on pre-iOS/Android phones.

2. The plain search mode when the Korean language is requested allows matching archaic Hangul with an ill-formed approximation written with a modern-Hangul-only input method.

3. The search root contains data analogous to the previous item for modern Hangul only allowing well-formed modern Hangul to be matched by ill-formed input where double letters have been typed by pressing the corresponding key twice without the shift key instead of being typed normally by pressing the corresponding key once with the shift key pressed.

## Why they seem questionable

### Item 1, searchjl

For the use case of quickly filtering an address book view, it makes sense to try matching the needle as a _prefix_ of each name in the address book. However, the Web API only supports full-string matching, which makes the use case implausible in the context of the available API.

Furthermore, there were _zero_ uses of searchjl in the HTTP Archive data set.

However, it's unclear how well HTTP Archive covers the Korean-language part of the Web. Also, one would expect filtering an address book to be behind login, and HTTP Archive crawls the public Web. However, many login-requiring sites advertise their JavaScript bundle already on the login page and, therefore, HTTP Archive does pick up JavaScript code that activates behind login.

### Item 2, matching archaic Hangul with an ill-formed approximation written with a modern-only input method when the Korean language is requested

As "archaic" suggests, archaic Hangul is not used for present-day Korean text and is of relevance to scholarly use. For it to be relevant in the context of the Web API that only allows for full-string matching, there would need to be a set of archaic Hangul strings to be filtered by a search key typed with a modern-Hangul-only input method. This is less plausible as a use case than being able to ctrl-f/cmd-f over a digitized historical document in apps that use the data for substring search (i.e. Chrome and Safari UI but _not_ the Web API), which I understand to motivate the existence of the data in CLDR.

Arguably, it's a layering violation for the search data to address input method concerns like this, but then one might argue that diacritic-insensitive search for the Latin script is about addressing an input method concern, too. AFAICT, Windows comes with an input method for archaic Hangul but other popular operating systems do not.

### Item 3, matching well-formed modern Hangul with an ill-formed approximation typed without the shift key in the context of all languages

This feature makes no sense to me. I can't infer a legitimate use case, and no one has told me when I've asked. My inference is that this data exists in the first place for completeness so that the way of approximating archaic Hangul also works for syllables that remain in modern Hangul. The archaic Hangul data is logically script-level data, but it is placed in a language-specific tailoring. This does not make sense as a matter of principle. My inference is that putting all the Hangul data in the search root would have made the other search collations, which each contain a _copy_ of the search root, too large so the principle of putting script-level things in the search root was violated and as a compromise the modern Hangul data was left in the search root and the archaic Hangul data was pushed to the Korean tailoring even though the data left is the search root is just weird on its own. (I haven't been able to get access to the minutes of the meeting from over a decade ago where the split was decided.)

If my inference is incorrect and/or if there is a legitimate modern-Hangul-related use case for the Hangul data in the search root, please let me know.


## Alternatives

An approximation of searchjl would fit ICU4X, so an alternative would be to modify searchjl instead of removing it. Specifically, by omitting the lines https://github.com/unicode-org/icu/blob/64b35481263ac4df37a28a9c549553ecc9710db2/icu4c/source/data/coll/ko.txt#L369-L379 searchjl would fit in the ICU4X data format. It's unclear to me what the user-visible purpose of those lines is. (If I'm reading those lines correctly, the patterns don't occur in well-formed modern Hangul text and don't occur when typing a consonant-only sequence using an IME.)


## Standards

I've been told that customizing the collation data is permitted by relevant standards. An anything-goes position in standards isn't particularly interesting for the purpose of deciding whether it's a good idea to do this, though.


## Test suites

Test suites that I'm aware of don't test this and, therefore, wouldn't fail.


## Other browsers

I don't expect Chrome or Safari to do this. I gather that the capability to search archaic Hangul with a modern-only input method comes from work on Chrome's ctrl-f/cmd-f feature from the time before the Blink fork from WebKit. Furthermore, WebKit on Apple operating systems uses the system copy of ICU4C. (Firefox's ctrl-f/cmd-f isn't collator-based and doesn't allow for searching archaic Hangul with a modern-only input method.)


## Platforms

All.

--
Henri Sivonen
hsiv...@mozilla.com
Reply all
Reply to author
Forward
0 new messages