Currently, [{T h i s s e t}] contains the 7-character string “Thisset”: spaces are ignored in string literals. This came as a surprise to both Mark Davis, who designed the support for string literals, and to Markus Scherer, who is probably the longest continuous maintainer of ICU’s UnicodeSet implementation.
The question was brought to the Properties and Algorithms group of the UTC, which was similarly surprised by the idea of a space-insensitive string literal, and pointed out that string literals don’t do that in any other language.
Proposal: Stop ignoring spaces in UnicodeSet string literals. [{T h i s s e t}] will now contain the 13-character string “T h i s s e t”.
Currently,
[:
XID continue:]
and
[:
^XID continue:]
and
[: ^XID continue:]
are valid, but
[:XID continue
:]
and
[:^
XID continue
:]
are ill-formed.
\p{
XID continue}
is also ill-formed. PD UTS61 treats [:^ as atomic. As in current ICU, spaces are not otherwise allowed within [: :] or \p{ }, except that U+0020 is ignored as part of name alias comparison.
Proposal: Disallow spaces (horizontal or vertical) within [:^; disallow vertical space or tabulations after [:. U+0020 SPACE remains allowed in the likes of [: X ID CONTINUE :] by ICU’s implementation of UAX44-LM3 (actually an earlier version of that rule; we don’t do “is”).
Currently, \p{XID_Continue=} is equivalent to \p{XID_Continue} or \p{XID_Continue=Yes}. Even weirder, \p{Uppercase_Letter=} is equivalent to \p{Uppercase_Letter} or \p{General_Category=Uppercase_Letter}.
Proposal: disallow an equals sign on unary queries.
In ICU 78, \p, \P, and \N are disallowed in a set unless they are property queries or named characters, but they are allowed in string literals, with the same meaning as unescaped p, P, and N.
\N{} escape sequences now being treated as characters rather than sets, they should be allowed in string literals; this means disallowing \N for N. For consistency and ease of lexing, \p, and \P should be disallowed too.
Proposal:
Allow \N{} escapes in string literals, and disallow \N as an escaped N. [This part is already done in #3828, but was not mentioned in the proposal to treat \N as a character rather than a set.]
Disallow \p and \P in string literals.
In UTS61 terms, this means the following change:
string-element ⩴
bracketed-literal-element | escaped-element | named-element | \p | \P | \N
In ICU4J, [{a}]=[a], [{a}-{z}]=[a-z], and [{aa}-{cz}] is the set of all 78 strings starting with [a-c] and ending with [a-z]. The string ranges used to be specified in UTS #35, but have since been retracted.
In ICU4C, [{a}]=[a], and both [{a}-{z}]=[a-z] and [{aa}-{cz}] are ill-formed.
Proposal: Allow ranges of bracketed code points, such as [{a}-{z}] or [{a}-z]; disallow string ranges. Note that this is what ICU4X did.
PD UTS #61 (like UTS #35) allows ≠ as part of a property query, so that \p{Property≠Value} and [:Property≠Value:] are equivalent to \P{Property=Value} and [:^Property=Value:]. This is also supported by ICU4X.
Proposal: Allow ≠, but reject any doubly negated property-query, e.g., [:^Property≠Value:] or \P{Property≠Value}.
PD UTS61 proposes adding support for escapes that have both the hex code point and the name, e.g., \xN{0061:LATIN SMALL LETTER A}, or the hex code point, the literal character, and the name, \xlN{0061:a:LATIN SMALL LETTER A}. The need for that has become apparent in testing of properties for new characters, where we make heavy usage of UnicodeSet. It parallels the well-established practice of citing characters by code point and name, or by code point and name while showing the glyph.
The use of the colon as a delimiter is inspired by its use as a delimiter for the version-qualifier in the more advanced corners of property-query syntax (which are out of scope for ICU).
Proposal: Accept hex:name and hex:literal:name escapes, both with the prefix \xN (no need for a separate \xlN prefix as currently proposed in PD UTS #61).
Rationale: \N{} has taken a life of its own, with C++23; extending it might lead to confusion, so let’s use \xN. But separate prefixes for \xN and \xlN don’t seem to be solving any problem and don’t improve readability.
\N and \p{Name=…} (which are backed by the same implementation) currently ignore spaces and case, but not medial hyphens; they do not take name aliases into account. This means that \N{PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET} does not work (but \N{PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET} works), and that \N{Latin small ligature o-e} does not work (but \N{Latin small ligature o e} works).
Proposal: Take name aliases into account, implement UAX44-LM2 as specified, as described in PD UTS#61 (#Named-Elements-Semantics) and PD UTS #61 (#Valid-Values-and-Resolved-Sets)
This has actually already been accepted as ICU-8963 and ICU-3736.
When a sufficiently simple UnicodeSet expression is parsed, its toPattern is normalized, e.g.,
UnicodeSet([ a-b d- qp-z ]).toPattern() is [abd-z], where for readability we have written UnicodeSet(𝑠) for C++ UnicodeSet("𝑠", errorCode).
However, when the expression contains inner UnicodeSets, including property-queries, the entire syntactic structure of higher levels (but not of the bottommost level) is preserved, although some pretty-printing is performed:
UnicodeSet([ a-b [ccc] d- qp-z ]).toPattern()
is [a-b[c]d-qp-z],
UnicodeSet([[ a-b d- qp-z ] & [: Let ter:]]).toPattern()
is [[a-bd-qp-z]&[: Let ter:]].
In addition, some escaping is performed even when unnecessary:
UnicodeSet([{Baden-Württemberg$}]).toPattern()
is [{Baden\-Württemberg\$}].
The intent here is that dependencies on properties should be preserved: calling toPattern on a set created from \p{XID_Continue} should yield a versionless reference to XID_Continue, not a set frozen at the current version of Unicode. However, there is no reason not to otherwise simplify the expression, and computing a string matching the exact input syntax while parsing just in case we need to preserve it requires considerable bookkeeping.
Proposal: Change the behaviour of toPattern to something more consistent, while retaining the property that for a string s, with s′ := UnicodeSet(s).toPattern(), UnicodeSet(s) == UnicodeSet(s′) independently of property values. For this purpose no properties are to be considered immutable: \p{ASCII} must not turn into [\x00-\x7F].
This is a bit of a blank cheque, because pretty-printing lies outside of the scope of the formal specification in UTS #61, and because it is not quite clear what will be easy and useful (for instance, converting the set arithmetic on property queries to disjunctive or conjunctive normal form might be counterproductive, but some application of De Morgan’s laws might be useful).
Following up on the changes approved by the TC on 2025-12-11, and continuing the effort to rigorously specify UnicodeSet syntax and have our implementation conform to that specification, I would like to propose the following changes to UnicodeSet behaviour
--
You received this message because you are subscribed to the Google Groups "icu-design" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-design+...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-design/CAN49p6rSXs2sb9awbjU0%2BpTgvCnXLZKYfJhhmwaUWJ0sS7fPHg%40mail.gmail.com.
For more options, visit https://groups.google.com/a/unicode.org/d/optout.