ICU API proposal: more UnicodeSet parsing changes

1 view
Skip to first unread message

Robin Leroy

unread,
Dec 30, 2025, 1:12:00 PM (3 days ago) 12/30/25
to icu-d...@unicode.org
Dear ICU team & users,

Following up on the changes approved by the TC on 2025-12-11, and continuing the effort to rigorously specify UnicodeSet syntax and have our implementation conform to that specification, I would like to propose the following changes to UnicodeSet behaviour
for: ICU 79
Please provide feedback by: Wednesday, 2026-01-07
Designated API reviewer: Markus

Some changes here are breaking changes, although I think they are in such weird undocumented corners of the syntax that they are unlikely to affect too many people; one of of them is even the worst kind of breaking, changing legal UnicodeSet expressions to legal UnicodeSet expressions with a different meaning. 
Note that we have already approved such a change on 2025-12-11, with [\N{LATIN SMALL LETTER A}-\N{LATIN SMALL LETTER Z}] (formerly a 1-element set, now a 26-element set).
The expressions whose interpretation was altered by the \N change were egregiously misleading; I think the same applies to the space-insensitive string literals, and the Properties and Algorithms Group of the UTC agreed, and so that the breaking change is likewise likely to help users more than it harms.

The breaking changes are first in the list below, clearly marked (and the legal-to-legal is first among those), with the list moving on to pure extensions of ICU’s supported UnicodeSet syntax towards the bottom. The final item is actually about UnicodeSet parsing, but about the output of toPattern.

Best regards,

Robin Leroy

Breaking, legal-to-legal: Space-sensitive string literals

Currently, [{T h i s s e t}] contains the 7-character string “Thisset”: spaces are ignored in string literals. This came as a surprise to both Mark Davis, who designed the support for string literals, and to Markus Scherer, who is probably the longest continuous maintainer of ICU’s UnicodeSet implementation.

The question was brought to the Properties and Algorithms group of the UTC, which was similarly surprised by the idea of a space-insensitive string literal, and pointed out that string literals don’t do that in any other language.

Proposal: Stop ignoring spaces in UnicodeSet string literals. [{T h i s s e t}] will now contain the 13-character string “T h i s s e t”.

Breaking, legal-to-illegal: No spaces in [:^

Currently,

[:
XID continue:]

and

[:
^XID continue:]

and

[:   ^XID continue:]

are valid, but

[:XID continue
:]

and

[:^
XID continue
:]

are ill-formed.

\p{
XID continue}

is also ill-formed. PD UTS61 treats [:^ as atomic. As in current ICU, spaces are not otherwise allowed within [: :] or \p{ }, except that U+0020 is ignored as part of name alias comparison.

Proposal: Disallow spaces (horizontal or vertical) within [:^; disallow vertical space or tabulations after [:. U+0020 SPACE remains allowed in the likes of [: X ID CONTINUE :] by ICU’s implementation of UAX44-LM3 (actually an earlier version of that rule; we don’t do “is”).

Breaking, legal-to-illegal: No unary property queries with a trailing equals sign

Currently, \p{XID_Continue=} is equivalent to \p{XID_Continue} or \p{XID_Continue=Yes}. Even weirder, \p{Uppercase_Letter=} is equivalent to \p{Uppercase_Letter} or \p{General_Category=Uppercase_Letter}.

Proposal: disallow an equals sign on unary queries.

Breaking, legal-to-illegal: Regularize escapes in strings

In ICU 78, \p, \P, and \N are disallowed in a set unless they are property queries or named characters, but they are allowed in string literals, with the same meaning as unescaped p, P, and N.

\N{} escape sequences now being treated as characters rather than sets, they should be allowed in string literals; this means disallowing \N for N. For consistency and ease of lexing, \p, and \P should be disallowed  too.

Proposal:

  • Allow \N{} escapes in string literals, and disallow \N as an escaped N. [This part is already done in #3828, but was not mentioned in the proposal to treat \N as a character rather than a set.]

  • Disallow \p and \P in string literals.

In UTS61 terms, this means the following change:

string-element

bracketed-literal-element | escaped-element | named-element | \p | \P | \N

Breaking only in Java, legal-to-illegal: bracketed ranges but no string ranges

In ICU4J, [{a}]=[a], [{a}-{z}]=[a-z], and [{aa}-{cz}] is the set of all 78 strings starting with [a-c] and ending with [a-z]. The string ranges used to be specified in UTS #35, but have since been retracted.

In ICU4C, [{a}]=[a], and both [{a}-{z}]=[a-z] and [{aa}-{cz}] are ill-formed.

Proposal: Allow ranges of bracketed code points, such as [{a}-{z}] or [{a}-z]; disallow string ranges. Note that this is what ICU4X did.

Binary query negation with NOT EQUAL TO

PD UTS #61 (like UTS #35) allows ≠ as part of a property query, so that \p{Property≠Value} and [:Property≠Value:] are equivalent to \P{Property=Value} and [:^Property=Value:]. This is also supported by ICU4X.

Proposal: Allow ≠, but reject any doubly negated property-query, e.g., [:^Property≠Value:] or \P{Property≠Value}.

Extended name escapes

PD UTS61 proposes adding support for escapes that have both the hex code point and the name, e.g., \xN{0061:LATIN SMALL LETTER A}, or the hex code point, the literal character, and the name, \xlN{0061:a:LATIN SMALL LETTER A}. The need for that has become apparent in testing of properties for new characters, where we make heavy usage of UnicodeSet. It parallels the well-established practice of citing characters by code point and name, or by code point and name while showing the glyph.

The use of the colon as a delimiter is inspired by its use as a delimiter for the version-qualifier in the more advanced corners of property-query syntax (which are out of scope for ICU).

Proposal: Accept hex:name and hex:literal:name escapes, both with the prefix \xN (no need for a separate \xlN prefix as currently proposed in PD UTS #61).

Rationale: \N{} has taken a life of its own, with C++23; extending it might lead to confusion, so let’s use \xN. But separate prefixes for \xN and \xlN don’t seem to be solving any problem and don’t improve readability.

Name aliases and UAX44-LM2 in \N and \p{Name=…}

\N and \p{Name=…} (which are backed by the same implementation) currently ignore spaces and case, but not medial hyphens; they do not take name aliases into account. This means that \N{PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET} does not work (but \N{PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET} works), and that \N{Latin small ligature o-e} does not work (but \N{Latin small ligature o e} works).

Proposal: Take name aliases into account, implement UAX44-LM2 as specified, as described in PD UTS#61 (#Named-Elements-Semantics) and PD UTS #61 (#Valid-Values-and-Resolved-Sets)

This has actually already been accepted as ICU-8963 and ICU-3736.

More consistent toPattern

When a sufficiently simple UnicodeSet expression is parsed, its toPattern is normalized, e.g.,

UnicodeSet([ a-b d- qp-z ]).toPattern() is [abd-z], where for readability we have written UnicodeSet(𝑠) for C++ UnicodeSet("𝑠", errorCode).

However, when the expression contains inner UnicodeSets, including property-queries, the entire syntactic structure of higher levels (but not of the bottommost level) is preserved, although some pretty-printing is performed:

UnicodeSet([ a-b [ccc] d- qp-z ]).toPattern()
is [a-b[c]d-qp-z],
UnicodeSet([[ a-b d- qp-z ]   & [: Let ter:]]).toPattern()
is [[a-bd-qp-z]&[: Let ter:]].

In addition, some escaping is performed even when unnecessary:
UnicodeSet([{Baden-Württemberg$}]).toPattern()
is [{Baden\-Württemberg\$}].

The intent here is that dependencies on properties should be preserved: calling toPattern on a set created from \p{XID_Continue} should yield a versionless reference to XID_Continue, not a set frozen at the current version of Unicode. However, there is no reason not to otherwise simplify the expression, and computing a string matching the exact input syntax while parsing just in case we need to preserve it requires considerable bookkeeping.

Proposal: Change the behaviour of toPattern to something more consistent, while retaining the property that for a string s, with s′ := UnicodeSet(s).toPattern(), UnicodeSet(s) == UnicodeSet(s′) independently of property values. For this purpose no properties are to be considered immutable: \p{ASCII} must not turn into [\x00-\x7F].

This is a bit of a blank cheque, because pretty-printing lies outside of the scope of the formal specification in UTS #61, and because it is not quite clear what will be easy and useful (for instance, converting the set arithmetic on property queries to disjunctive or conjunctive normal form might be counterproductive, but some application of De Morgan’s laws might be useful).



Markus Scherer

unread,
Dec 30, 2025, 1:45:35 PM (3 days ago) 12/30/25
to Robin Leroy, icu-d...@unicode.org
On Tue, Dec 30, 2025 at 10:12 AM Robin Leroy <eggr...@unicode.org> wrote:
Following up on the changes approved by the TC on 2025-12-11, and continuing the effort to rigorously specify UnicodeSet syntax and have our implementation conform to that specification, I would like to propose the following changes to UnicodeSet behaviour

all lgtm tnx!
hny
markus

Mark Davis Ⓤ

unread,
Dec 30, 2025, 8:20:22 PM (3 days ago) 12/30/25
to Markus Scherer, Robin Leroy, icu-d...@unicode.org
Those all look great.

Somewhat tangential to the goal for UTS #61 compatibility: As well as separating out the parsing of UnicodeSets, it would be useful to separate out the formatting of UnicodeSets. For escaping, I often find it useful to be able to have a completely flattened output, eg [[a-z]\p{whitespace}] that doesn't preserve the property. You can do this manually, but I've seen people do it with uset.complement().complement(), which doesn't preserve strings. In CLDR, we've also had to develop a few UnicodeSet formatters, to produce more readable formats. 

--
You received this message because you are subscribed to the Google Groups "icu-design" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-design+...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-design/CAN49p6rSXs2sb9awbjU0%2BpTgvCnXLZKYfJhhmwaUWJ0sS7fPHg%40mail.gmail.com.
For more options, visit https://groups.google.com/a/unicode.org/d/optout.
Reply all
Reply to author
Forward
0 new messages