Uppercasing with U+0345 in non-Greek locales

Jeff Davis

unread,

Jun 12, 2025, 7:22:24 PMJun 12

to icu-support

SpecialCasing.txt, under "Unconditional Mappings", says:

# IMPORTANT-when iota-subscript (0345) is uppercased or titlecased,
# the result will be incorrect unless the iota-subscript is moved to the end
# of any sequence of combining marks. Otherwise, the accents will go on the capital iota.
# This process can be achieved by first transforming the text to NFC before casing.
# E.g. <alpha><iota_subscript><acute> is uppercased to <ALPHA><acute><IOTA>

But it appears that ICU only follows that rule when uppercasing in a Greek locale. Is that right?

I am seeing, with u_strToUpper(03B1 0301 0345):

el-GR: 0391 0399

en-US: 0391 0301 0399

and with u_strToUpper(03B1 0345 0301):

el-GR: 0391 0399

en-US: 0391 0399 0301

There seems to be some specific tailoring for the Greek result as well, which makes it harder to see exactly what's going on. Regardless, the rule about moving 0345 to the end doesn't seem to be applied at all for non-Greek locales.

While the rule to move 0345 to the end is specific to Greek, it doesn't interfere with other language rules, so it seems reasonable to apply it in any locale. Is there a reason it's only applied in a Greek locale?

Regards,

Jeff Davis

Markus Scherer

unread,

Jun 12, 2025, 9:04:57 PMJun 12

to Jeff Davis, icu-support

On Thu, Jun 12, 2025 at 4:22 PM Jeff Davis <pg...@j-davis.com> wrote:

SpecialCasing.txt, under "Unconditional Mappings", says:

# IMPORTANT-when iota-subscript (0345) is uppercased or titlecased,
# the result will be incorrect unless the iota-subscript is moved to the end
# of any sequence of combining marks. Otherwise, the accents will go on the capital iota.
# This process can be achieved by first transforming the text to NFC before casing.
# E.g. <alpha><iota_subscript><acute> is uppercased to <ALPHA><acute><IOTA>

But it appears that ICU only follows that rule when uppercasing in a Greek locale. Is that right?

Yes. Normally, the uppercase functions do what the spec says:

https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G34078

R1 toUppercase(X): Map each character C in X to Uppercase_Mapping(C).

So in order to get Greek right when there is an implicit or explicit iota subscript (ypogegrammeni) followed by another combining mark, you need to normalize first.

I don't actually think that NFC will do the trick, and might make it worse, because it pulls the iota subscript into a composite letter despite a following lower-ccc combining mark.

NFD would work, or (with ICU), FCD.

You might want to submit a bug report about the misleading text in SpecialCasing.txt, via https://www.unicode.org/reporting.html

There seems to be some specific tailoring for the Greek result as well

Yes. Greek uppercasing is very special; it drops most of the combining marks: ICU-5456

Best regards,

markus

Mark Davis Ⓤ

unread,

Jun 12, 2025, 9:09:37 PMJun 12

to Markus Scherer, Jeff Davis, icu-support

And I believe one has to distinguish Greek titlecasing accent behavior; that it is different from uppercasing the first grapheme cluster.

--
You received this message because you are subscribed to the Google Groups "icu-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-support...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/CAN49p6pfh_5pOcdEHDoBzC0j5qAdtpLDN1WY%2BTDz4taKQAPOxg%40mail.gmail.com.

Jeff Davis

unread,

Jun 13, 2025, 11:45:41 AMJun 13

to icu-support, Markus Scherer, icu-support, Jeff Davis

On Thursday, June 12, 2025 at 6:04:57 PM UTC-7 Markus Scherer wrote:

Yes. Normally, the uppercase functions do what the spec says:
https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G34078
R1 toUppercase(X): Map each character C in X to Uppercase_Mapping(C).

That makes sense. My understanding of the word "mapping" doesn't allow for moving codepoints around.

So in order to get Greek right when there is an implicit or explicit iota subscript (ypogegrammeni) followed by another combining mark, you need to normalize first.

You mean it's essentially a warning that you should normalize before uppercasing? And they put it in "Unconditional mappings" because that's always an OK thing to do?

I don't actually think that NFC will do the trick, and might make it worse, because it pulls the iota subscript into a composite letter despite a following lower-ccc combining mark.
NFD would work, or (with ICU), FCD.

Hmm. I thought that NFC(NFD(X)) = NFC(X). Is that not the case?

You might want to submit a bug report about the misleading text in SpecialCasing.txt, via https://www.unicode.org/reporting.html

I have filed the report as "An Error in Publications/Data", which linked to this discussion, but I did not receive an issue number or confirmation email so I can't link to it from here. Hopefully I have included the right context so they understand the source of the confusion.

Regards,
Jeff Davis

Jeff Davis

unread,

Jun 13, 2025, 7:30:13 PMJun 13

to icu-support, Jeff Davis, Markus Scherer, icu-support

On Friday, June 13, 2025 at 8:45:41 AM UTC-7 Jeff Davis wrote:

I have filed the report as "An Error in Publications/Data", which linked to this discussion, but I did not receive an issue number or confirmation email so I can't link to it from here. Hopefully I have included the right context so they understand the source of the confusion.

I received a reply and it appears to have been assigned #276812. Though they seemed to think that it may be an ICU bug instead -- I assume that will be sorted out.

Markus Scherer

unread,

Jun 16, 2025, 9:21:12 PMJun 16

to Jeff Davis, icu-support

On Fri, Jun 13, 2025 at 8:45 AM Jeff Davis <pg...@j-davis.com> wrote:

So in order to get Greek right when there is an implicit or explicit iota subscript (ypogegrammeni) followed by another combining mark, you need to normalize first.

You mean it's essentially a warning that you should normalize before uppercasing? And they put it in "Unconditional mappings" because that's always an OK thing to do?

It's related to the unconditional mappings. They are unconditional -- in terms of this file -- because they are not language-specific and not context-sensitive.

I read it as a recommendation, not a warning. This is basically one character in Unicode -- plus a few dozen which have it in their Decomposition_Mapping -- for which one would want to do some normalization-related processing before the regular uppercase mapping.

Hmm. I thought that NFC(NFD(X)) = NFC(X). Is that not the case?

It is, but NFC was designed to favor compact representation over ease of processing, so NFC text does not always fit the FCD test for canonical order.

You might want to submit a bug report about the misleading text in SpecialCasing.txt, via https://www.unicode.org/reporting.html

I have filed the report as "An Error in Publications/Data", which linked to this discussion, but I did not receive an issue number or confirmation email so I can't link to it from here. Hopefully I have included the right context so they understand the source of the confusion.

It will be fine. It will be routed to the Properties and Algorithms group of which I am the chair :-)