Uppercasing with U+0345 in non-Greek locales

56 views
Skip to first unread message

Jeff Davis

unread,
Jun 12, 2025, 7:22:24 PMJun 12
to icu-support
SpecialCasing.txt, under "Unconditional Mappings", says:

# IMPORTANT-when iota-subscript (0345) is uppercased or titlecased,
#  the result will be incorrect unless the iota-subscript is moved to the end
#  of any sequence of combining marks. Otherwise, the accents will go on the capital iota.
#  This process can be achieved by first transforming the text to NFC before casing.
#  E.g. <alpha><iota_subscript><acute> is uppercased to <ALPHA><acute><IOTA>

But it appears that ICU only follows that rule when uppercasing in a Greek locale. Is that right?

I am seeing, with u_strToUpper(03B1 0301 0345):
   el-GR: 0391 0399
   en-US: 0391 0301 0399

and with u_strToUpper(03B1 0345 0301):
   el-GR: 0391 0399
   en-US: 0391 0399 0301

There seems to be some specific tailoring for the Greek result as well, which makes it harder to see exactly what's going on. Regardless, the rule about moving 0345 to the end doesn't seem to be applied at all for non-Greek locales.

While the rule to move 0345 to the end is specific to Greek, it doesn't interfere with other language rules, so it seems reasonable to apply it in any locale. Is there a reason it's only applied in a Greek locale?

Regards,
    Jeff Davis

Markus Scherer

unread,
Jun 12, 2025, 9:04:57 PMJun 12
to Jeff Davis, icu-support
On Thu, Jun 12, 2025 at 4:22 PM Jeff Davis <pg...@j-davis.com> wrote:
SpecialCasing.txt, under "Unconditional Mappings", says:

# IMPORTANT-when iota-subscript (0345) is uppercased or titlecased,
#  the result will be incorrect unless the iota-subscript is moved to the end
#  of any sequence of combining marks. Otherwise, the accents will go on the capital iota.
#  This process can be achieved by first transforming the text to NFC before casing.
#  E.g. <alpha><iota_subscript><acute> is uppercased to <ALPHA><acute><IOTA>

But it appears that ICU only follows that rule when uppercasing in a Greek locale. Is that right?

Yes. Normally, the uppercase functions do what the spec says:
R1 toUppercase(X): Map each character C in X to Uppercase_Mapping(C).

So in order to get Greek right when there is an implicit or explicit iota subscript (ypogegrammeni) followed by another combining mark, you need to normalize first.
I don't actually think that NFC will do the trick, and might make it worse, because it pulls the iota subscript into a composite letter despite a following lower-ccc combining mark.
NFD would work, or (with ICU), FCD.

You might want to submit a bug report about the misleading text in SpecialCasing.txt, via https://www.unicode.org/reporting.html

There seems to be some specific tailoring for the Greek result as well

Yes. Greek uppercasing is very special; it drops most of the combining marks: ICU-5456

Best regards,
markus

Mark Davis Ⓤ

unread,
Jun 12, 2025, 9:09:37 PMJun 12
to Markus Scherer, Jeff Davis, icu-support
And I believe one has to distinguish Greek titlecasing accent behavior; that it is different from uppercasing the first grapheme cluster.

--
You received this message because you are subscribed to the Google Groups "icu-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-support...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/CAN49p6pfh_5pOcdEHDoBzC0j5qAdtpLDN1WY%2BTDz4taKQAPOxg%40mail.gmail.com.

Jeff Davis

unread,
Jun 13, 2025, 11:45:41 AMJun 13
to icu-support, Markus Scherer, icu-support, Jeff Davis
On Thursday, June 12, 2025 at 6:04:57 PM UTC-7 Markus Scherer wrote:
Yes. Normally, the uppercase functions do what the spec says:
R1 toUppercase(X): Map each character C in X to Uppercase_Mapping(C).

That makes sense. My understanding of the word "mapping" doesn't allow for moving codepoints around.
 
So in order to get Greek right when there is an implicit or explicit iota subscript (ypogegrammeni) followed by another combining mark, you need to normalize first.

You mean it's essentially a warning that you should normalize before uppercasing? And they put it in "Unconditional mappings" because that's always an OK thing to do?
 
I don't actually think that NFC will do the trick, and might make it worse, because it pulls the iota subscript into a composite letter despite a following lower-ccc combining mark.
NFD would work, or (with ICU), FCD.

Hmm. I thought that NFC(NFD(X)) = NFC(X). Is that not the case?
 
You might want to submit a bug report about the misleading text in SpecialCasing.txt, via https://www.unicode.org/reporting.html

I have filed the report as "An Error in Publications/Data", which linked to this discussion, but I did not receive an issue number or confirmation email so I can't link to it from here. Hopefully I have included the right context so they understand the source of the confusion.

Regards,
    Jeff Davis

Jeff Davis

unread,
Jun 13, 2025, 7:30:13 PMJun 13
to icu-support, Jeff Davis, Markus Scherer, icu-support
On Friday, June 13, 2025 at 8:45:41 AM UTC-7 Jeff Davis wrote:
I have filed the report as "An Error in Publications/Data", which linked to this discussion, but I did not receive an issue number or confirmation email so I can't link to it from here. Hopefully I have included the right context so they understand the source of the confusion.

I received a reply and it appears to have been assigned #276812. Though they seemed to think that it may be an ICU bug instead -- I assume that will be sorted out.

 

Markus Scherer

unread,
Jun 16, 2025, 9:21:12 PMJun 16
to Jeff Davis, icu-support
On Fri, Jun 13, 2025 at 8:45 AM Jeff Davis <pg...@j-davis.com> wrote: 
So in order to get Greek right when there is an implicit or explicit iota subscript (ypogegrammeni) followed by another combining mark, you need to normalize first.

You mean it's essentially a warning that you should normalize before uppercasing? And they put it in "Unconditional mappings" because that's always an OK thing to do?

It's related to the unconditional mappings. They are unconditional -- in terms of this file -- because they are not language-specific and not context-sensitive.

I read it as a recommendation, not a warning. This is basically one character in Unicode -- plus a few dozen which have it in their Decomposition_Mapping -- for which one would want to do some normalization-related processing before the regular uppercase mapping.

Hmm. I thought that NFC(NFD(X)) = NFC(X). Is that not the case?

It is, but NFC was designed to favor compact representation over ease of processing, so NFC text does not always fit the FCD test for canonical order.

You might want to submit a bug report about the misleading text in SpecialCasing.txt, via https://www.unicode.org/reporting.html

I have filed the report as "An Error in Publications/Data", which linked to this discussion, but I did not receive an issue number or confirmation email so I can't link to it from here. Hopefully I have included the right context so they understand the source of the confusion.

It will be fine. It will be routed to the Properties and Algorithms group of which I am the chair :-)
Thanks for reporting.

markus
Reply all
Reply to author
Forward
0 new messages