A few general questions re CLDR and UCD

152 views
Skip to first unread message

Jens Troeger

unread,
Jan 23, 2024, 11:52:12 AMJan 23
to CLDR Users Public Mail List
Hello,

I spent some time poking through the CLDR 44.1 files (link) and the UCD 15.1.0 files (link), and I have a few questions. Hopefully, this is the right place to post them… 🤞🏼

• Is the UCD hosted anywhere? I didn’t find it on Unicode’s Github org here.

• The supplementalData file (link), element languageData seems to contain the scripts used for languages, based on the language’s ISO639 code (both 2 and 3 letters). The script names seem to be “short” versions though (e.g. “Latn” for “Latin”) and the scripts section sc in the UCD’s PropertyValueAliases file (link) can be used to find the “long” version of the script name?

• Following the previous point, the short script names “Jpan” (link) and “Kore” (link) and “Hans” (link) don’t have long script names in the PropertyValueAliases file? How come?

• What is the recommended pathway to find a script name(s) for a given language code, other than the above?

Much thanks!
Jens

Asmus Freytag

unread,
Jan 23, 2024, 1:10:57 PMJan 23
to cldr-...@unicode.org
On 1/22/2024 3:11 PM, Jens Troeger wrote:
• ... the short script names “Jpan” (link) and “Kore” (link) and “Hans” (link) don’t have long script names in the PropertyValueAliases file? How come?

There are two definitions of "script".

One classification is useful for libraries. It gives multi-script writing systems a single code. It may also treat a typographic variant as its own script. (cf. Latf).

The other is the classification used by Unicode, where a script is defined by its repertoire, with no repertoires overlapping. (Shared use of characters is handled with the Script_Extensions property).

The PropertyValueAliases are properly restricted to values that can be assigned to a given property. "Scripts" that do not fit the definition of "Script" as that property is defined in the Unicode Standard are thus not covered and you must look at some other place for information about them.

A./


Asmus Freytag

unread,
Jan 23, 2024, 1:15:30 PMJan 23
to cldr-...@unicode.org
On 1/22/2024 3:11 PM, Jens Troeger wrote:
• Is the UCD hosted anywhere? I didn’t find it on Unicode’s Github org here.

The UCD precedes GitHub and is published as a collection of files on the Unicode server. For consistency and stability this is unlikely to change.

The sources are (non-publicly) managed on GitHub, but the "truth" are the published files.

A./

Edward Welbourne

unread,
Jan 24, 2024, 5:42:48 AMJan 24
to jens.t...@gmail.com, cldr-...@unicode.org
Jens Troeger (Tue 23/01/2024 17:56) wrote:
> I spent some time poking through the CLDR 44.1 files (...) and the
> UCD 15.1.0 files (...), and I have a few questions. Hopefully, this

> is the right place to post them… 🤞🏼

It's where I'd have asked ;^>

> • Is the UCD hosted anywhere? I didn’t find it on Unicode’s Github org
> here.

https://www.unicode.org/ucd/
has the links you probably want.

> • The supplementalData file (...), element languageData seems to


> contain the scripts used for languages, based on the language’s
> ISO639 code (both 2 and 3 letters). The script names seem to be
> “short” versions though (e.g. “Latn” for “Latin”) and the scripts

> section sc in the UCD’s PropertyValueAliases file (...) can be used


> to find the “long” version of the script name?
>

> • Following the previous point, the short script names “Jpan” (...)
> and “Kore” (...) and “Hans” (...) don’t have long script names in


> the PropertyValueAliases file? How come?
>
> • What is the recommended pathway to find a script name(s) for a given
> language code, other than the above?

I just look in common/main/en.xml's localeDisplayNames/scripts/script
elements:

<script type="Hans">Simplified</script>
<script type="Hans" alt="stand-alone">Simplified Han</script>
<script type="Jpan">Japanese</script>
<script type="Kore">Korean</script>

You can surely do the same for any other locale's appropriate name for a
script.

Eddy.

Peter Constable

unread,
Jan 26, 2024, 11:46:19 AMJan 26
to Asmus Freytag, cldr-...@unicode.org

Adding to what Asmus said: CLDR and Unicode us IDs compatible with ISO 15924. That standard was developed by two different ISO committees, one of which develops standards used by libraries. Unicode is the organization designated to maintain the ISO 15924 code. You can find _those_ data files here:

 

ISO 15924 Registration Authority (unicode.org)

 

 

Peter

 

--
You received this message because you are subscribed to the Google Groups "CLDR Users Public Mail List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cldr-users+...@unicode.org.
To view this discussion on the web visit https://groups.google.com/a/unicode.org/d/msgid/cldr-users/bc388ca5-487e-40ab-8107-7b703131d1a5%40ix.netcom.com.

Jens Troeger

unread,
Jan 26, 2024, 11:46:19 AMJan 26
to CLDR Users Public Mail List, edward.w...@qt.io, cldr-...@unicode.org, Jens Troeger
Thank you Asmus and Edward, that helped!

The problem I want to solve: given an ISO639 language code (e.g. “fa”) find the range of Unicode code points covering the Farsi script. So what I have now is: go into supplementalData.xml and for “fa” find “Arab” here, then go into PropertyValueAliases.txt and for “Arab” find “Arabic” here, then go into Scripts and gather together the code point ranges for “Arabic” here.

Following Edward’s suggestion I could also go from “Arab” to “Arabic” via en.xml here, and en.xml would also supply the other long script names I mentioned above. (Might need to be careful with underscore/dash and spaces, as they differ between en.xml and Scripts.txt.)

Asmus, I think I need to do some more homework to understand your last paragraph regarding restricted script properties.

With many greetings,
Jens

Markus Scherer

unread,
Jan 26, 2024, 11:46:20 AMJan 26
to Jens Troeger, CLDR Users Public Mail List
On Tue, Jan 23, 2024 at 8:52 AM Jens Troeger <jens.t...@gmail.com> wrote:
I spent some time poking through the CLDR 44.1 files (link) and the UCD 15.1.0 files (link), and I have a few questions. Hopefully, this is the right place to post them… 🤞🏼

• Is the UCD hosted anywhere? I didn’t find it on Unicode’s Github org here.


• The supplementalData file (link), element languageData seems to contain the scripts used for languages, based on the language’s ISO639 code (both 2 and 3 letters). The script names seem to be “short” versions though (e.g. “Latn” for “Latin”) and the scripts section sc in the UCD’s PropertyValueAliases file (link) can be used to find the “long” version of the script name?

Yes. The script codes are actually ISO 15924 script codes, from https://www.unicode.org/iso15924/ but the reference for BCP 47 language tags is here: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
Unicode CLDR also defines meanings for a couple of private use codes, see the LDML spec.

• Following the previous point, the short script names “Jpan” (link) and “Kore” (link) and “Hans” (link) don’t have long script names in the PropertyValueAliases file? How come?

The script codes in the UCD are a subset. They are used to say which character has which Script or Script_Extensions property value. UCD Scripts.txt does not list codes for unencoded scripts or aliases/subsets/supersets.
You can find all of them in the ISO 15924 and BCP 47 lists.

• What is the recommended pathway to find a script name(s) for a given language code, other than the above?

If you mean a long property value name, then there are none defined for scripts that are not used in the UCD.

If you mean display names, then you can find them in the per-locale CLDR data files, according to the LDML spec.

Most users get display names via libraries like ICU, ICU4X, JDK, etc.

Viele Grüße,
markus

Jens Troeger

unread,
Jan 28, 2024, 6:50:58 AMJan 28
to CLDR Users Public Mail List, marku...@gmail.com, CLDR Users Public Mail List, Jens Troeger
Peter, thank you for the link!

Markus, danke Dir 😊 I think I have to do some more reading — the sheer amount of information and its structure is a bit overwhelming. To answer your question:

> If you mean a long property value name, then there are none defined for scripts that are not used in the UCD.

…and for those scripts that _are_ used in the UCD? I’m beginning to wonder if my initial question might be a bit too naive due to the complexity of information encoded by the standards 🤔

Much thanks!
Jens

Peter Constable

unread,
Jan 29, 2024, 1:15:40 AMJan 29
to Jens Troeger, CLDR Users Public Mail List, marku...@gmail.com, CLDR Users Public Mail List, Jens Troeger

Jens,

 

Earlier in this thread you wrote:

 

> The problem I want to solve: given an ISO639 language code (e.g. “fa”) find the range of Unicode code points covering the Farsi script.

 

This assumes that, for a given language, there is a clear and unique mapping to a scripts. In fact, any language can potentially be written with many different scripts. Now, in practice there are many cases in which a particular language is _commonly_ written only with one script. But there are many languages for which writing in more than one script is not _un_common. Also, even when only one script is used in general contexts there can be special contexts in which the same language is written using another script. So, for the general case, one needs to map from a language identifier to a set of one or more scripts that may be likely to be used in particular contexts.

 

Once you get from a particular language to a language-script combination, there’s a subsequent possibility that there could be more than one _orthography_ or _writing system_ using that script for that language. E.g, Oxford spelling of English and IPA transcription of English both use Latin script, but the spellings and repertoire of Latin characters used are very different. Even aside from technical transcription, some languages can have multiple orthographies for common, non-technical usage, and the different orthographies may involve different characters. Again, in the general case, to get from a language ID to a set of Unicode code points, it may be necessary to select from multiple, relevant choices that can affect what set of characters are used.

 

 

Peter

 

From: Jens Troeger <jens.t...@gmail.com>
Date: Sunday, January 28, 2024 at 3:51
AM
To: CLDR Users Public Mail List <cldr-...@unicode.org>
Cc: marku...@gmail.com <marku...@gmail.com>, CLDR Users Public Mail List <cldr-...@unicode.org>, Jens Troeger <jens.t...@gmail.com>
Subject: Re: A few general questions re CLDR and UCD

--

You received this message because you are subscribed to the Google Groups "CLDR Users Public Mail List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cldr-users+...@unicode.org.

Markus Scherer

unread,
Jan 29, 2024, 1:49:20 AMJan 29
to Jens Troeger, CLDR Users Public Mail List
On Sun, Jan 28, 2024 at 3:50 AM Jens Troeger <jens.t...@gmail.com> wrote:
> If you mean a long property value name, then there are none defined for scripts that are not used in the UCD.

…and for those scripts that _are_ used in the UCD?

PropertyValueAliases.txt, as you found.
markus

Markus Scherer

unread,
Jan 29, 2024, 1:55:10 AMJan 29
to Peter Constable, Jens Troeger, CLDR Users Public Mail List
On Sun, Jan 28, 2024 at 10:15 PM Peter Constable <pgc...@msn.com> wrote:

Earlier in this thread you wrote:

 

> The problem I want to solve: given an ISO639 language code (e.g. “fa”) find the range of Unicode code points covering the Farsi script.

 

This assumes that, for a given language, there is a clear and unique mapping to a scripts. ...


And maybe for Jens it would be useful to look at the CLDR data for exemplar characters per locale (including, for some languages, multiple locales for different scripts).

markus

Mark Davis Ⓤ

unread,
Jan 29, 2024, 5:11:48 PMJan 29
to Markus Scherer, Peter Constable, Jens Troeger, CLDR Users Public Mail List

--
You received this message because you are subscribed to the Google Groups "CLDR Users Public Mail List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cldr-users+...@unicode.org.

Lorna Evans

unread,
Jan 30, 2024, 11:18:58 AMJan 30
to Jens Troeger, CLDR Users Public Mail List, edward.w...@qt.io

You can find the exemplars for well-established languages like “fa” here:

https://github.com/unicode-org/cldr/blob/880c12f2917d79c4aaf52b3665e8da43b87eb35c/common/main/fa.xml

--
You received this message because you are subscribed to the Google Groups "CLDR Users Public Mail List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cldr-users+...@unicode.org.
Reply all
Reply to author
Forward
0 new messages