What use case do Hangul tailorings in the CLDR search root address?

3 views
Skip to first unread message

Henri Sivonen

unread,
Feb 15, 2023, 11:03:06 AM2/15/23
to cldr-...@unicode.org, jshi...@gmail.com
Hi,

Both in the interest of optimizing collation data size for ICU4X and in the interest of optimizing Firefox's binary size, I'm trying to understand the Korean search collation tailorings in CLDR.

It appears that the purpose of ko-u-co-search is to allow searching archaic Hangul haystack with a needle that is written using a modern-only IME. Why is this kind of working around the constraints of text input for the needle considered to belong in the CLDR search collation data as opposed to being considered to belong to the text entry layer?

The Hangul part of und-u-co-search seems to apply to modern Hangul the same pattern that ko-u-co-search applies to archaic Hangul. What use case does this address? As far as I can tell, ctrl/cmd-f in Chrome and Safari still requires matches to match whole syllables in the haystack, so the end result isn't matching with Hangul analyzed as an alphabet with arbitrary alphabetic substrings. What the search collation does seem to accomplish is the ability to avoid pressing the shift key in the normal way so that the ill-formed Hangul result of pressing a consonant key twice finds what shift-pressing the key once would find. Why is that valuable?

macOS Finder seems to require the needle to be the prefix of a word, so even the kind of search that filters a set of strings (file names) could end up with the same result using the sorting root for searching without the search tailoring if the user uses the shift key in the usual way.

Also, the notion that one should be able to search for modern compound lead jamo by entering the non-compound form twice appears in searchjl as well:
Why?

(Also, what do the lines https://github.com/unicode-org/cldr/blob/62d07d14371bf88f13644442178d47e001129811/common/collation/ko.xml#L878-L882 do? It seems to me that the comment above explains it, but I have a bit of trouble understanding what the explanation is saying.)

Finally, considering that ECMA-402 doesn't support substring search but only allows "searching" in the sense of the haystack being a list of strings where the entire needle is matched against an entire item from the list, is searchjl practically relevant to ECMA-402?

--
Henri Sivonen
hsiv...@mozilla.com
Reply all
Reply to author
Forward
0 new messages