Filtered Transliterations in ICU4C, difference between documentation and behaviour

28 views
Skip to first unread message

Jan Krämer

unread,
Aug 7, 2025, 1:01:56 PMAug 7
to icu-support
Starting from the "PDF-Ligature-Copy-Paste"-problem I got sucked into playing with ICU general transformations, and am now writing a short notebook of fun things you can do with it. While trying to explain global  and local filters, I created 3 different, very nonsensical, compound IDs:

1. "[[:Uppercase_Letter:]] any-remove; any-upper"
2. "any-null; [[:Uppercase_Letter:]] any-remove; any-upper"
3. "[[:Uppercase_Letter:]]; any-remove; any-upper"

From my reading of the documentation[1, 2], I expected case 1 and 2 to be equivalent. Unfortunately, my test showed 1 and 3 to be the same!

Input: "This is a test. 123ABC!"
1. "his is a test. 123!"
2. "HIS IS A TEST. 123!"
3. "his is a test. 123!"

I do believe the output of the second version to be the (intended) output of the compound ID 1. Is my reading correct?

From my (rather rushed and absolutely uninformed) look into the codebase my intuition tells me that inside of the function TransliteratorIDParser::parseCompoundID [3], the call to function TransliteratorIDParser::parseGlobalFilter [4] will succeed even if the filter is actually local. Or is the code correct as intended, and the documentation missing something?

Please excuse the lack of pull-request, I did not write any C++ for a long time, and ICU as a code-base is a bit overwhelming...
(Also: Atlassian, rather rudely, was unable to correctly log me in, as far as the unicode-org.atlassian.net domain is concerned, so I could not write an issue there...)

Have a nice day

Rich Gillam

unread,
Aug 7, 2025, 6:36:20 PMAug 7
to Jan Krämer, icu-support
This isn’t my area of expertise, but I read the documentation the same way you do.  I’d recommend filing a ticket.

—Rich Gillam
   ICU TC vice-chair

--
You received this message because you are subscribed to the Google Groups "icu-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-support...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/17e09ca3-ed83-4d93-aae5-e766d185fda8n%40unicode.org.

Reply all
Reply to author
Forward
0 new messages