ICU API proposal: upcoming changes to corner cases of UnicodeSet behaviour

8 views

Skip to first unread message

Robin Leroy

unread,

Dec 19, 2025, 1:24:45 PM12/19/25

to icu-d...@unicode.org

Dear ICU Team & Users,

[Note: If you are on the ICU-TC mailing list, you have already read this proposal.]

Since April this year, work has been going on to prepare a new Unicode Technical Standard #61, “Unicode Set Notation”, to rigorously specify UnicodeSet syntax and ensure that it corresponds to what is implemented.

In order to achieve alignment between the proposed draft standard and the ICU implementation, in the draft standard I proposed changes to the existing behaviour of UnicodeSet in ICU4C and ICU4J.

The following changes have already been accepted by the Technical Committee at its 2025-12-11 meeting. Terms like “property-query” and “NamedSingleton” refer to elements of the grammar in draft UTS #61.

Force variables to be grammatical instead of arbitrary text replacement.
This only affects users who supply a custom SymbolTable when parsing UnicodeSet pattern strings.
Currently, variables $like_this are implemented by text substitution in the UnicodeSet expression (with some strange edge cases: a property-query cannot span a variable boundary, and in fact a variable cannot even start with a property-query, so that with $letters=[:L:], [$letters-[z]] is an error).
This means that it is impossible to parse a UnicodeSet expression without knowing the expansion of the variables (that is a problem for implementers, who often want to pre-parse set-valued variables) and it allows for misleading usage. For instance, with $x=a, $y=-, $z=z, [$x$y$z] is the 26-element set [a-z], [$x$z$y] is the 3-element set [az-].
The Technical Committee decided to require that variables correspond to characters, strings, or sets. $a-$b would therefore be a range for two characters, a set difference for two sets, or ill-formed with any other combination of types.
Note that ICU4X needed to pre-parse its UnicodeSets in transliterator rules, and so has implemented this already; we therefore know that CLDR transliterators are compatible with this change.
Stop using lookupMatcher in UnicodeSet parsing.
This only affects users who supply a custom SymbolTable when parsing UnicodeSet pattern strings.
UnicodeSet currently has a (poorly-documented) alternative way of using variables, by implementing lookupMatcher in the SymbolTable. Characters can be mapped to UnicodeSets.
This is not used directly, but is instead used by the ICU transliterator and RBBI implementations to get pre-parsed set-valued variables: Transliterator rule variables map to a PUA range
(U+E000..U+F8FF), which in turn maps to sets via lookupMatcher. (RBBI uses a single noncharacter instead and relies on the order of calls).
This is obscure, brittle, and complicates parsing; what it achieves is made redundant by the decision to make variables grammatical above (if a variable can directly represent a set, there is no need to jump through a PUA code point to do so).
In addition, syntax characters can be remapped by this mechanism, to varying effect; this further complicates parsing and can only lead to absurd behaviour.
The Technical Committee decided to get rid of this mechanism entirely: UnicodeSet parsing will no longer call SymbolTable.lookupMatcher().
Note that this will entail changes to RBBI and transliterator rule parsing, so that they have some other way of using their pre-parsed sets.
Treat \N as a character, not a set, with no exception for backward compatibility.
Currently [\N{LATIN LETTER SMALL A}-\N{LATIN LETTER SMALL Z}] is the 1-element set [a]. This is highly misleading (and I have been bitten by it). The PD UTS #61, Revision 1, draft 2 (archival version: L2/25-265) proposes allowing \N (named-element) where a character is (in RangeElement), but also in some places where a set is (as a NamedSingleton), for backward compatibility with users who might have written \N{LATIN LETTER SMALL A} for [a], or [[a-z]-\N{LATIN LETTER SMALL A}] for [abd-z].
The Technical Committee discussed this at length, and eventually decided to allow named-element as a character (in the RangeElement production), but not to allow named-element to occur as a set. That is, the TC rejected the changes highlighted in cyan in UTS #61, Revision 1, draft 2.
The reasons are twofold:
1. This complicates the grammar and thus the understandability of UnicodeSet syntax going forward, for a relatively small advantage in backward compatibility: \N is not very widely used, and adding [] in a few places is a small migration cost.
2. C++23 allows for \N escapes in strings. Allowing \N to stand for a set would mean that, while UnicodeSet("[\\N{LATIN LETTER SMALL A}-\\N{LATIN LETTER SMALL Z}]") and UnicodeSet("[\N{LATIN LETTER SMALL A}-\N{LATIN LETTER SMALL Z}]") would be equivalent, UnicodeSet("\\N{LATIN LETTER SMALL C}") would be OK, but UnicodeSet("\N{LATIN LETTER SMALL C}") would be an error; or that UnicodeSet(R"([[a-z]-\N{LATIN LETTER SMALL C}])") would be OK, but that UnicodeSet("[[a-z]-\N{LATIN LETTER SMALL C}]") would be an error.
Since UnicodeSet syntax is otherwise largely consistent with C++ escape sequences, the confusion this would induce was deemed to outweigh the backward incompatibility.

Best regards,

Robin Leroy

Markus Scherer

unread,

Dec 19, 2025, 2:23:43 PM12/19/25

to Robin Leroy, icu-d...@unicode.org

lgtm tnx!

markus

Reply all

Reply to author

Forward

0 new messages