• ... the short script names “Jpan” (link) and “Kore” (link) and “Hans” (link) don’t have long script names in the PropertyValueAliases file? How come?
There are two definitions of "script".
One classification is useful for libraries. It gives multi-script writing systems a single code. It may also treat a typographic variant as its own script. (cf. Latf).
The other is the classification used by Unicode, where a script is defined by its repertoire, with no repertoires overlapping. (Shared use of characters is handled with the Script_Extensions property).
The PropertyValueAliases are properly restricted to values that can be assigned to a given property. "Scripts" that do not fit the definition of "Script" as that property is defined in the Unicode Standard are thus not covered and you must look at some other place for information about them.
A./
• Is the UCD hosted anywhere? I didn’t find it on Unicode’s Github org here.
The UCD precedes GitHub and is published as a collection of files on the Unicode server. For consistency and stability this is unlikely to change.
The sources are (non-publicly) managed on GitHub, but the "truth" are the published files.
A./
It's where I'd have asked ;^>
> • Is the UCD hosted anywhere? I didn’t find it on Unicode’s Github org
> here.
https://www.unicode.org/ucd/
has the links you probably want.
> • The supplementalData file (...), element languageData seems to
> contain the scripts used for languages, based on the language’s
> ISO639 code (both 2 and 3 letters). The script names seem to be
> “short” versions though (e.g. “Latn” for “Latin”) and the scripts
> section sc in the UCD’s PropertyValueAliases file (...) can be used
> to find the “long” version of the script name?
>
> • Following the previous point, the short script names “Jpan” (...)
> and “Kore” (...) and “Hans” (...) don’t have long script names in
> the PropertyValueAliases file? How come?
>
> • What is the recommended pathway to find a script name(s) for a given
> language code, other than the above?
I just look in common/main/en.xml's localeDisplayNames/scripts/script
elements:
<script type="Hans">Simplified</script>
<script type="Hans" alt="stand-alone">Simplified Han</script>
<script type="Jpan">Japanese</script>
<script type="Kore">Korean</script>
You can surely do the same for any other locale's appropriate name for a
script.
Eddy.
Adding to what Asmus said: CLDR and Unicode us IDs compatible with ISO 15924. That standard was developed by two different ISO committees, one of which develops standards used by libraries. Unicode is the organization designated to maintain the ISO 15924 code. You can find _those_ data files here:
ISO 15924 Registration Authority (unicode.org)
Peter
--
You received this message because you are subscribed to the Google Groups "CLDR Users Public Mail List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
cldr-users+...@unicode.org.
To view this discussion on the web visit
https://groups.google.com/a/unicode.org/d/msgid/cldr-users/bc388ca5-487e-40ab-8107-7b703131d1a5%40ix.netcom.com.
• The supplementalData file (link), element languageData seems to contain the scripts used for languages, based on the language’s ISO639 code (both 2 and 3 letters). The script names seem to be “short” versions though (e.g. “Latn” for “Latin”) and the scripts section sc in the UCD’s PropertyValueAliases file (link) can be used to find the “long” version of the script name?
• What is the recommended pathway to find a script name(s) for a given language code, other than the above?
Jens,
Earlier in this thread you wrote:
> The problem I want to solve: given an ISO639 language code (e.g. “fa”) find the range of Unicode code points covering the Farsi script.
This assumes that, for a given language, there is a clear and unique mapping to a scripts. In fact, any language can potentially be written with many different scripts. Now, in practice there are many cases in which a particular language is _commonly_ written only with one script. But there are many languages for which writing in more than one script is not _un_common. Also, even when only one script is used in general contexts there can be special contexts in which the same language is written using another script. So, for the general case, one needs to map from a language identifier to a set of one or more scripts that may be likely to be used in particular contexts.
Once you get from a particular language to a language-script combination, there’s a subsequent possibility that there could be more than one _orthography_ or _writing system_ using that script for that language. E.g, Oxford spelling of English and IPA transcription of English both use Latin script, but the spellings and repertoire of Latin characters used are very different. Even aside from technical transcription, some languages can have multiple orthographies for common, non-technical usage, and the different orthographies may involve different characters. Again, in the general case, to get from a language ID to a set of Unicode code points, it may be necessary to select from multiple, relevant choices that can affect what set of characters are used.
Peter
From:
Jens Troeger <jens.t...@gmail.com>
Date: Sunday, January 28, 2024 at 3:51 AM
To: CLDR Users Public Mail List <cldr-...@unicode.org>
Cc: marku...@gmail.com <marku...@gmail.com>, CLDR Users Public Mail List <cldr-...@unicode.org>, Jens Troeger <jens.t...@gmail.com>
Subject: Re: A few general questions re CLDR and UCD
--
You received this message because you are subscribed to the Google Groups "CLDR Users Public Mail List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
cldr-users+...@unicode.org.
To view this discussion on the web visit https://groups.google.com/a/unicode.org/d/msgid/cldr-users/4c7cd6e1-1962-40b9-a407-22f14cb462ccn%40unicode.org.
> If you mean a long property value name, then there are none defined for scripts that are not used in the UCD.…and for those scripts that _are_ used in the UCD?
Earlier in this thread you wrote:
> The problem I want to solve: given an ISO639 language code (e.g. “fa”) find the range of Unicode code points covering the Farsi script.
This assumes that, for a given language, there is a clear and unique mapping to a scripts. ...
--
You received this message because you are subscribed to the Google Groups "CLDR Users Public Mail List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cldr-users+...@unicode.org.
To view this discussion on the web visit https://groups.google.com/a/unicode.org/d/msgid/cldr-users/CAN49p6q2XHjuEV62bTkj5pdpkAWnPsrcpB4OU1Qx_wH8kU%2B2Mw%40mail.gmail.com.
You can find the exemplars for well-established languages like “fa” here:
https://github.com/unicode-org/cldr/blob/880c12f2917d79c4aaf52b3665e8da43b87eb35c/common/main/fa.xml
--
You received this message because you are subscribed to the Google Groups "CLDR Users Public Mail List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cldr-users+...@unicode.org.
To view this discussion on the web visit https://groups.google.com/a/unicode.org/d/msgid/cldr-users/420c6496-9d1c-498f-b1cb-80f42e6d7314n%40unicode.org.