In the course of writing tests for Qt's number-parsing code, I took some
time to look at what variations we were seeing in the symbols used for
signs and separators (fractional part, a.k.a. "decimal point", grouping
and exponent). I'll here focus on the signs, but a review of the
various separators in numeric representations may also be in order.
There is a more detailed version of the first part of this story at [0].
[0] https://bugreports.qt.io/browse/QTBUG-139922
What I noticed is that - although a grep for minusSign and plusSign
elements in CLDR's common/main/*.xml found only matches whose visible
content was an ASCII dash (HYPHEN-MINUS) or U+2212 (MINUS SIGN) for
minusSign, and plain ASCII + (PLUS SIGN) in plusSign - several entries
(the ones I found were all in locales using right-to-left scripts)
contained invisible mark symbols along with that visible sign. These
included U+061C (ARABIC LETTER MARK), U+200E (LEFT-TO-RIGHT MARK) and
U+200F (RIGHT-TO-LEFT MARK). Some of the combinations seen look suspect
and I'm forced to consider the possibility that these invisibles are not
there because someone intended them to be but only because they couldn't
see them when copying and pasting what they thought was just the visible
part.
Apparently (from discussions while investigating this) Arabic also uses
ASCII dash as a hyphen between words; in that context it follows the
usual right-to-left flow of text - it is only when *used as* a sign that
it follows the left-to-right flow of the digits. This may account for
its tendency to show up in conjunction with a BiDi marker when copied
into a CLDR contribution.
It also occurs to me that, when parsing signs of numbers, at least in
locales that use a right-to-left script, it is probably prudent to
quietly ignore invisibles, whether in the CLDR-derived data or in the
text to be parsed, as the user typing the latter may well be unaware
that they're even typing them (their input method may be silently
inserting them where it guesses they may be needed; or they may be
copying and pasting from somewhere they can't see them) or simply unable
to type them (so requiring them would make number input impossible).
The LDML section on the number symbols [1] makes no mention of invisible
marks, while that on number parsing advocates as much lenience as one
can safely entertain. The section on Number Patters does mention BiDi
characters and advocates ignoring them, citing the general LDML guidance
[4] on Lenient Parsing.
[1] https://www.unicode.org/reports/tr35/tr35-numbers.html#number-symbols
[2] https://www.unicode.org/reports/tr35/tr35-numbers.html#Parsing_Numbers
[3] https://www.unicode.org/reports/tr35/tr35-numbers.html#number-patterns
[4] https://www.unicode.org/reports/tr35/tr35.html#Lenient_Parsing
Is it worth, in CLDR itself, cleaning up the data so that extraneous
invisibles are not included in the common/main/*.xml data ?
For that matter, are the ones I found in fact extraneous ?
See the minusSign for ar.xml, ckB.xml for examples.
Should the LDML doc of [1] advise users to ignore BiDi markers in the
symbol fields, as [3] effectively does for the patterns ?
I gather that Arabic, while its text is right to left, writes its
numbers left to right, so presumably a L2R marker before the sign or
first digit would be required. Is there (I have failed to find it) any
guidance on when one *should* be inserting BiDi markers when formatting
numbers ?
I suppose I should likewise check for similar on grouping separators and
the separator between whole and fractional parts of a number. In the
case of the exponent separator, some locales have multi-character
sequences (typically denoting "times ten to the power", e.g. ×10^ in
Swedish usage; and Arabic locales seem to use something similar) and I
can imagine these (when they include letters) may need to be bracketed
in R2L and L2R, at least in some cases. I would hope the fields in CLDR
would take care of that (after all, they always appear between the
digits of mantissa and exponent) appropriately, but now I'm left
wondering whether I can rely on that.
Eddy.
--
You received this message because you are subscribed to the Google Groups "CLDR - Users Public Mail List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cldr-users+...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/cldr-users/DU0PR02MB8218BF2DAC705BD36DCFF3848700A%40DU0PR02MB8218.eurprd02.prod.outlook.com.
On Sep 4, 2025, at 10:27, Tim Partridge <tim...@perdix.ndonet.com> wrote:
My (possibly incorrect) understanding is that Arabic writers write numbers right to left. In Arabic the units are spoken first followed by higher powers of ten. However to an English speaker reading printed Arabic, the result looks like what an English speaker would have done and start with the highest power of ten and write left to right.
Regards,
Tim
Steven R. Loomis (04 September 2025 18:20) promptly replied with:
> I would say the marks are there on purpose, in fact they are now
> highlighted to the vetters when data is input, and also example
> formatted items are shown in various contexts (LTR, RTL,
> Neutral). [snip]
ah ! Thanks for that - I'm not familiar with the contribution process,
being only a consumer of the data - so that means I should trust what's
there, when serialising numbers, I take it.
> So the marks are used to improve display of formatted numbers in
> different contexts.
>
> As you noted, the section on parsing notes: [snip]
>> Ignore all format characters: in particular, ignore any RLM, LRM or
>> ALM used to control BIDI formatting.
So I only need to change parsing of numbers (which I'm rewriting anyway
- that's how this all came to my attention), not how we glean our data
from CLDR or how we serialise using what that gave us.
Eddy.