BiDi (and other) markers in signs (and other symbols) in CLDR data

84 views
Skip to first unread message

Edward Welbourne

unread,
Sep 4, 2025, 11:44:18 AM (2 days ago) Sep 4
to 'Edward Welbourne' via CLDR - Users Public Mail List
Hi CLDR folks,

In the course of writing tests for Qt's number-parsing code, I took some
time to look at what variations we were seeing in the symbols used for
signs and separators (fractional part, a.k.a. "decimal point", grouping
and exponent). I'll here focus on the signs, but a review of the
various separators in numeric representations may also be in order.
There is a more detailed version of the first part of this story at [0].
[0] https://bugreports.qt.io/browse/QTBUG-139922

What I noticed is that - although a grep for minusSign and plusSign
elements in CLDR's common/main/*.xml found only matches whose visible
content was an ASCII dash (HYPHEN-MINUS) or U+2212 (MINUS SIGN) for
minusSign, and plain ASCII + (PLUS SIGN) in plusSign - several entries
(the ones I found were all in locales using right-to-left scripts)
contained invisible mark symbols along with that visible sign. These
included U+061C (ARABIC LETTER MARK), U+200E (LEFT-TO-RIGHT MARK) and
U+200F (RIGHT-TO-LEFT MARK). Some of the combinations seen look suspect
and I'm forced to consider the possibility that these invisibles are not
there because someone intended them to be but only because they couldn't
see them when copying and pasting what they thought was just the visible
part.

Apparently (from discussions while investigating this) Arabic also uses
ASCII dash as a hyphen between words; in that context it follows the
usual right-to-left flow of text - it is only when *used as* a sign that
it follows the left-to-right flow of the digits. This may account for
its tendency to show up in conjunction with a BiDi marker when copied
into a CLDR contribution.

It also occurs to me that, when parsing signs of numbers, at least in
locales that use a right-to-left script, it is probably prudent to
quietly ignore invisibles, whether in the CLDR-derived data or in the
text to be parsed, as the user typing the latter may well be unaware
that they're even typing them (their input method may be silently
inserting them where it guesses they may be needed; or they may be
copying and pasting from somewhere they can't see them) or simply unable
to type them (so requiring them would make number input impossible).

The LDML section on the number symbols [1] makes no mention of invisible
marks, while that on number parsing advocates as much lenience as one
can safely entertain. The section on Number Patters does mention BiDi
characters and advocates ignoring them, citing the general LDML guidance
[4] on Lenient Parsing.

[1] https://www.unicode.org/reports/tr35/tr35-numbers.html#number-symbols
[2] https://www.unicode.org/reports/tr35/tr35-numbers.html#Parsing_Numbers
[3] https://www.unicode.org/reports/tr35/tr35-numbers.html#number-patterns
[4] https://www.unicode.org/reports/tr35/tr35.html#Lenient_Parsing

Is it worth, in CLDR itself, cleaning up the data so that extraneous
invisibles are not included in the common/main/*.xml data ?
For that matter, are the ones I found in fact extraneous ?
See the minusSign for ar.xml, ckB.xml for examples.

Should the LDML doc of [1] advise users to ignore BiDi markers in the
symbol fields, as [3] effectively does for the patterns ?

I gather that Arabic, while its text is right to left, writes its
numbers left to right, so presumably a L2R marker before the sign or
first digit would be required. Is there (I have failed to find it) any
guidance on when one *should* be inserting BiDi markers when formatting
numbers ?

I suppose I should likewise check for similar on grouping separators and
the separator between whole and fractional parts of a number. In the
case of the exponent separator, some locales have multi-character
sequences (typically denoting "times ten to the power", e.g. ×10^ in
Swedish usage; and Arabic locales seem to use something similar) and I
can imagine these (when they include letters) may need to be bracketed
in R2L and L2R, at least in some cases. I would hope the fields in CLDR
would take care of that (after all, they always appear between the
digits of mantissa and exponent) appropriately, but now I'm left
wondering whether I can rely on that.

Eddy.

Steven R. Loomis

unread,
Sep 4, 2025, 12:20:14 PM (2 days ago) Sep 4
to Edward Welbourne, 'Edward Welbourne' via CLDR - Users Public Mail List
I would say the marks are there on purpose, in fact they are now highlighted to the vetters when data is input, and also example formatted items are shown in various contexts (LTR, RTL, Neutral). See for example <https://st.unicode.org/cldr-apps/v#/ar/Number_Formatting_Patterns/53687a25c19b6481>

So the marks are used to improve display of formatted numbers in different contexts.

As you noted, the section on parsing notes: <https://www.unicode.org/reports/tr35/tr35.html#loose-matching>

> Ignore all format characters: in particular, ignore any RLM, LRM or ALM used to control BIDI formatting.

Steven

--
Steven R. Loomis
Code Hive Tx, LLC



--
You received this message because you are subscribed to the Google Groups "CLDR - Users Public Mail List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cldr-users+...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/cldr-users/DU0PR02MB8218BF2DAC705BD36DCFF3848700A%40DU0PR02MB8218.eurprd02.prod.outlook.com.

Tim Partridge

unread,
Sep 4, 2025, 1:27:52 PM (2 days ago) Sep 4
to Steven R. Loomis, Edward Welbourne, 'Edward Welbourne' via CLDR - Users Public Mail List
My (possibly incorrect) understanding is that Arabic writers write numbers right to left. In Arabic the units are spoken first followed by higher powers of ten. However to an English speaker reading printed Arabic,  the result looks like what an English speaker would have done and start with the highest power of ten and write left to right. 

Unicode of course has given the digits particular properties in the BiDi algorithm. 

As for decimal separators worldwide, for money in particular,  it can be a currency symbol or letters.

Higher value groupings of digits can be interesting too as in India it starts with a group of three, then moves to groups of two reflecting how their languages express numbers. (Lakh and Crore in English)

Regards,

   Tim


From: Steven R. Loomis <srl...@gmail.com>
Sent: Thursday, September 4, 2025 5:20:00 PM
To: Edward Welbourne <edward.w...@qt.io>
Cc: 'Edward Welbourne' via CLDR - Users Public Mail List <cldr-...@unicode.org>
Subject: Re: BiDi (and other) markers in signs (and other symbols) in CLDR data
 

Patrick CHEW

unread,
Sep 4, 2025, 1:54:11 PM (2 days ago) Sep 4
to Tim Partridge, Steven R. Loomis, Edward Welbourne, 'Edward Welbourne' via CLDR Users Public Mail List

On Sep 4, 2025, at 10:27, Tim Partridge <tim...@perdix.ndonet.com> wrote:

My (possibly incorrect) understanding is that Arabic writers write numbers right to left. In Arabic the units are spoken first followed by higher powers of ten. However to an English speaker reading printed Arabic,  the result looks like what an English speaker would have done and start with the highest power of ten and write left to right. 

When writing in Arabic (or Farsi, Urdu, Sindhi, Uyghur, etc), numbers are written/read left-to-right; some might chose to write right-to-left, as described, to ensure spacing, but the folk I’ve watched all have written left-to-right. 🤷🏻‍♂️ 

Regards,

   Tim

Edward Welbourne

unread,
Sep 5, 2025, 3:46:57 AM (yesterday) Sep 5
to Steven R. Loomis, 'Edward Welbourne' via CLDR - Users Public Mail List
Thanks, variously, for clarification (and, of course, if there's further
details worth sharing, I'll be all ears).

Steven R. Loomis (04 September 2025 18:20) promptly replied with:


> I would say the marks are there on purpose, in fact they are now
> highlighted to the vetters when data is input, and also example
> formatted items are shown in various contexts (LTR, RTL,

> Neutral). [snip]

ah ! Thanks for that - I'm not familiar with the contribution process,
being only a consumer of the data - so that means I should trust what's
there, when serialising numbers, I take it.

> So the marks are used to improve display of formatted numbers in
> different contexts.
>

> As you noted, the section on parsing notes: [snip]


>> Ignore all format characters: in particular, ignore any RLM, LRM or
>> ALM used to control BIDI formatting.

So I only need to change parsing of numbers (which I'm rewriting anyway
- that's how this all came to my attention), not how we glean our data
from CLDR or how we serialise using what that gave us.

Eddy.

Reply all
Reply to author
Forward
0 new messages