Applying tailoring

40 views
Skip to first unread message

Todd Lang

unread,
Sep 18, 2025, 9:32:53 AM (12 days ago) Sep 18
to icu-support
I'm attempting to get a collator that gives me a case-insensitive, accent-sensitive locale, but with a twist - whitespace sensitivity.

I thought maybe I could do this via a `ucol_open` on the base locale I'm interested in "en-u-ks-level2", and then ask for whatever tailoring it may have via `col_getRules`.  Then, I append my tailoring I'm after to the returned rules with `& \u0020 < \u00a0` (for example) and then re-open the collator with `ucol_openRules`, but this gives me a collator with only the tailoring rule I've supplied against the default collation.

Is there a way to "stack" the tailoring against a specified locale?

Alternatively, please tell me I'm holding it wrong and there's a much smarter way to go about what I'm doing.  :|

Markus Scherer

unread,
Sep 18, 2025, 1:06:02 PM (12 days ago) Sep 18
to Todd Lang, icu-support
Hi Todd,

On Thu, Sep 18, 2025 at 6:32 AM Todd Lang <todd...@kiyote.ca> wrote:
I'm attempting to get a collator that gives me a case-insensitive, accent-sensitive locale, but with a twist - whitespace sensitivity.

-u-ks-level2 should be all you need.
Spaces have primary weights. See https://www.unicode.org/charts/collation/ and click "Whitespace" on the left.

The collator only ignores spaces if you also turn on alternate=shifted (-ka-shifted).

I thought maybe I could do this via a `ucol_open` on the base locale I'm interested in "en-u-ks-level2", and then ask for whatever tailoring it may have via `col_getRules`.  Then, I append my tailoring I'm after to the returned rules with `& \u0020 < \u00a0` (for example) and then re-open the collator with `ucol_openRules`, but this gives me a collator with only the tailoring rule I've supplied against the default collation.

getRules() returns the tailoring rules that build on top of the root sort order. English has no tailoring (same sort order as CLDR root), so its rule string is empty.

Hope this helps,
markus

Todd Lang

unread,
Sep 18, 2025, 1:19:52 PM (12 days ago) Sep 18
to icu-support, Markus Scherer, icu-support, Todd Lang
Hi Markus,

First, thank you for responding.  I very much appreciate you taking the time out of your day to respond to my email.

I will admit, though, that " See https://www.unicode.org/charts/collation/ and click "Whitespace" on the left." is a little confusing to me.  I'm not exactly an expert and that chart has left me scratching my head.  :D

Just to confirm - a -u-ks-level2 collation will treat a space and a non-breaking space as different?  My experience hasn't quite lined up with that, but again - I could be holding it wrong.  It's all likely user-error and I apologize for that.

Markus Scherer

unread,
Sep 19, 2025, 12:42:19 PM (11 days ago) Sep 19
to Todd Lang, icu-support
On Thu, Sep 18, 2025 at 10:19 AM Todd Lang <todd...@kiyote.ca> wrote:
I will admit, though, that " See https://www.unicode.org/charts/collation/ and click "Whitespace" on the left." is a little confusing to me.  I'm not exactly an expert and that chart has left me scratching my head.  :D

There is a "help" page link... --> https://www.unicode.org/charts/collation/help.html

Basically, on each chart, the characters in the left vertical column have primary differences. To the right of each character are others that are only secondary-different or even less.

If you hover on a table cell, the flyover text shows you the [primary | secondary | tertiary] collation weights.

Just to confirm - a -u-ks-level2 collation will treat a space and a non-breaking space as different?

No.
You were asking about "whitespace sensitivity". Whitespace characters do have primary weights, so they are not ignored on any level, which is what "insensitive" usually means here.

However, by using secondary strength, you ignore all tertiary and lower differences. Tertiary includes case, but also differences of compatibility variants and other minor differences.

The difference between U+0020 space and U+00A0 nbsp is on tertiary level.
Your additional tailoring changes this to a primary difference; a primary or secondary difference distinguishes these characters under -ks-level2.

Why do you want to ignore all minor differences except for these?

Best regards,
markus

Todd Lang

unread,
Sep 19, 2025, 12:57:59 PM (11 days ago) Sep 19
to icu-support, Markus Scherer, icu-support, Todd Lang
Alright, I apologize for a confusing mess of requests here. 
What I'm specifically looking for is a collator that is case-insensitive, but accent-sensitive and whitespace-sensitive.  That is, two strings like 'test string` and `Test(nbsp)String` would end up being not equal, but `test string` and `Test String` would be equal.  I'm building a small harness app to test out various permutations.  This is part of a larger project, so it's been more challenging for me to pick this up and isolate the various interactions, so I really hope I haven't been wasting your time.  Once I get this smaller app coded up I can speak more coherently about what I'm seeing.

Thank you for your guide on how to interpret that chart, it's nice to know my rough reading of it was inline with how it was meant to be interpreted, btw.

Todd Lang

unread,
Sep 22, 2025, 1:17:21 PM (8 days ago) Sep 22
to icu-support, Todd Lang, Markus Scherer, icu-support
So, I've created a small app to test the cases I'm interested in. I guess I'm trying to explain this in a more complicated way than I need.
What I'm trying to do is create a collation where everything is distinct _except_ case. 

Is such a thing possible?

Markus Scherer

unread,
Sep 22, 2025, 2:53:46 PM (8 days ago) Sep 22
to Todd Lang, icu-support
On Mon, Sep 22, 2025 at 10:17 AM Todd Lang <todd...@kiyote.ca> wrote:
What I'm trying to do is create a collation where everything is distinct _except_ case. 
Is such a thing possible?

You want to pick an arbitrary collator / sort order, and keep every tertiary and quaternary distinction that it makes, except for ones made only because of case distinctions?
That sounds like a research project which would add a couple of thousand overrides for every collator, and you would incur the time and heap memory for building collators at runtime unless you add the mappings into your ICU collation tailoring data files and rebuild.

This use case is not, and cannot be, covered by parametric settings.

markus

Todd Lang

unread,
Sep 22, 2025, 2:55:38 PM (8 days ago) Sep 22
to icu-support, Markus Scherer, icu-support, Todd Lang
Yeah, I was afraid of that.  Obtaining collation parity with SQL Server is proving to be incredibly problematic.  The downsides of trying to emulate something "doing it wrong" with a library that "does it right".

David Rowe

unread,
Sep 22, 2025, 9:17:49 PM (7 days ago) Sep 22
to icu-s...@unicode.org
Have you tried using https://icu4c-demos.unicode.org/icu-bin/collation.html ? You could put your test cases in the "Input" box then try adding collation information in the "Append rules" box and/or select options under "Settings" until the "sort" button sorts the list correctly.

In fact, if you put the lines in the "Input" box in the order you want them sorted, the "Output" box will display "The input is already in sorted order." when your sort options and tailoring are correctly set to sort the input lines into that order.
--
You received this message because you are subscribed to the Google Groups "icu-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-support...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/b06ca0a8-3a75-4138-8e3d-80f902eb3c0en%40unicode.org.

Message has been deleted

Todd Lang

unread,
Sep 23, 2025, 2:15:11 PM (7 days ago) Sep 23
to icu-support, David Rowe
Hi David,

I appreciate that link.  I've played around with it, and it confirms (to no one's suprrise) what Markus has pointed out.  I am trying to create a difference at a level that is explicitly ignored, it seems.  As soon as you no longer are interested in case, the differences in whitespace also seem to fall away.  

Markus Scherer

unread,
Sep 23, 2025, 2:19:18 PM (7 days ago) Sep 23
to Todd Lang, icu-support, David Rowe
On Tue, Sep 23, 2025 at 11:15 AM Todd Lang <todd...@kiyote.ca> wrote:
... it confirms (to no one's suprrise) what Markus has pointed out.  I am trying to create a difference at a level that is explicitly ignored, it seems.  As soon as you no longer are interested in case, the differences in whitespace also seem to fall away.

Right. To do what you want, you would need to use tertiary or better strength, and tailor any character that sorts tertiary-after its (usually lowercase) peer to instead sort the same. For example, "&a=A". Across Unicode, that would be something like a couple of thousand such rules, and you would want to adjust it for each new Unicode version.

markus

Rich Gillam

unread,
Sep 23, 2025, 3:18:47 PM (7 days ago) Sep 23
to Markus Scherer, Todd Lang, icu-support, David Rowe
You guys probably already discussed this and eliminated it, but this seems like a situation where the simplest solution would be to preprocess the string— apply a case-folding transformation to the string first, and then compare the resulting case-folded strings with tertiary strength.

—Rich Gillam

--
You received this message because you are subscribed to the Google Groups "icu-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-support...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/CAN49p6oNrkTLmViLOjrZ8EgpWSaJM8%2BbKZfrtBDgk8%2B-jHW9PQ%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "ICU - Team" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-team+u...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-team/CAN49p6oNrkTLmViLOjrZ8EgpWSaJM8%2BbKZfrtBDgk8%2B-jHW9PQ%40mail.gmail.com.

Markus Scherer

unread,
Sep 24, 2025, 6:19:56 PM (5 days ago) Sep 24
to Todd Lang, Rich Gillam, icu-support, David Rowe
On Wed, Sep 24, 2025 at 6:03 AM Todd Lang <todd...@kiyote.ca> wrote:
I can't seem to make that work properly with ucol_openRules, though?  Like, how does one apply this in a rule?  I had thought it might be "[strength 2] & <tailoring rules here>" but it seems to not shift to something like a "ks-level2" locale collation.

I just tried this:
        UErrorCode errorCode = U_ZERO_ERROR;
        UCollator *coll = ucol_openRules(u"[strength 2]&a=b", -1, UCOL_DEFAULT, UCOL_DEFAULT, NULL, &errorCode);
        UCollationResult result = ucol_strcoll(coll, u"a", -1, u"A", -1);
        printf("a vs. A: %d\n", result);
        ucol_close(coll);

It prints, as expected: (0 = UCOL_EQUAL = case-insensitive)
a vs. A: 0

When I change the code to say "[strength 3]" I get: (-1 = UCOL_LESS)
a vs. A: -1

Seems to work...?

Did you pass UCOL_DEFAULT into the UCollationStrength argument of ucol_openRules() (fourth argument)?
If not, then that overrides what the rules say.

markus

Markus Scherer

unread,
Sep 24, 2025, 6:38:10 PM (5 days ago) Sep 24
to Todd Lang, Rich Gillam, icu-support, David Rowe
On Wed, Sep 24, 2025 at 3:19 PM Markus Scherer <marku...@gmail.com> wrote:
Did you pass UCOL_DEFAULT into the UCollationStrength argument of ucol_openRules() (fourth argument)?
If not, then that overrides what the rules say.

Todd Lang

unread,
Sep 25, 2025, 9:02:20 PM (4 days ago) Sep 25
to Rich Gillam, Markus Scherer, icu-support, David Rowe
I can't seem to make that work properly with ucol_openRules, though?  Like, how does one apply this in a rule?  I had thought it might be "[strength 2] & <tailoring rules here>" but it seems to not shift to something like a "ks-level2" locale collation.
Reply all
Reply to author
Forward
0 new messages