ISO 639 URIs

17 views
Skip to first unread message

Christian Chiarcos

unread,
Jul 7, 2020, 12:40:53 PM7/7/20
to open-linguistics, Linked Data for Language Technology Community Group, public-...@w3.org
Dear all,

for almost a decade, the Linguistic Linked Open Data community has largely
relied on http://www.lexvo.org/ for providing LOD-compliant language
identifier URIs, esp. with respect to ISO 639-3. Unfortunately, this got a
out of sync with the official standard over the years (and when I tried to
confirm this impression by checking one of the more recently created
language tags, csp [Southern Ping Chinese], I found that lexvo was down).

However, even if this is fixed, the synchronization issue will arise
again, and as ISO 639 keeps developing (at a slow pace), I was wondering
whether we should not consider a general shift from lexvo URIs to those
provided by the official registration authorities.

For ISO 693-1 and ISO 692-2, this is the Library of Congress, and they
provide
- a human-readable view: http://id.loc.gov/vocabulary/iso639-2/eng.html,
resp. http://id.loc.gov/vocabulary/iso639-1/en.html -- this is actually
machine-readable, too: XHTML+RDFa!),
- a machine-readable view (e.g.,
http://id.loc.gov/vocabulary/iso639-1/en.nt,
http://id.loc.gov/vocabulary/iso639-2/eng.nt), and
- content negotiation (http://id.loc.gov/vocabulary/iso639-2/eng,
http://id.loc.gov/vocabulary/iso639-1/en, working at least for
application/rdf+xml)

The problem here is ISO 693-3. The registration authority is SIL and they
provide resolvable URIs, indeed, e.g., http://iso639-3.sil.org/code/eng.
However, this is plain XHTML only, nothing machine-readable (in particular
not the mapping to the other ISO 639 standards). On the positive side,
their URIs seem to be stable, and also to preserve deprecated/retired
codes (https://iso639-3.sil.org/code/dud).

I'm wondering what people think. Basically, I see four alternatives to
Lexvo URIs:
- Work with current SIL URIs, even though these do not provide Linked Data.
- Approach SIL to provide an RDF dump (if not anything more advanced) in
addition to the HTML and TSV editions they currently provide.
- Approach IANA about an RDF edition of the BCP47 subtag registry
(https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry)?
This contains a curated subset of ISO language tags and is supposed to be
used in RDF anyway. [This has been suggested before:
https://www.w3.org/wiki/Languages_as_RDF_Resources]
- Approach the Datahub team to provide an RDF view on their CSV collection
of language codes (https://datahub.io/core/language-codes, harvested from
LoC and the IANA subtag registry, but regularly updated)

What would be your preferences? Any other ideas? In any case, if we're
going to reach out to SIL, IANA or Datahub, we should be able to
demonstrate that this is a request from a broader community, because it
would come with some effort for them.

Best,
Christian

NB: Apologies for sending this to multiple mailing lists, but I think we
should work towards a broader consensus for language resources in general
here.

Robert Forkel

unread,
Jul 7, 2020, 1:19:20 PM7/7/20
to Christian Chiarcos, open-linguistics, Linked Data for Language Technology Community Group, public-...@w3.org
Just wanted to mention that the URLs of the form
http://iso639-3.sil.org/code/eng are also a fairly recent development,
and - as far as I know - did not come with any commitment of SIL to
keep these stable. But then, they probably carry enough semantics to
serve as a human-resolvable identifier even if they don't resolve for
machines anymore.

best
robert
> --
> You received this message because you are subscribed to the Google Groups "open-linguistics" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to open-linguisti...@googlegroups.com.
> To view this discussion on the web, visit https://groups.google.com/d/msgid/open-linguistics/op.0nd8mcm1br5td5%40kitaba.

Robert Forkel

unread,
Jul 8, 2020, 1:06:15 AM7/8/20
to Christian Chiarcos, open-linguistics, Linked Data for Language Technology Community Group, public-...@w3.org
A note on the downloadable data provided by SIL for the ISO-639-3
codes: For quite some time one of the tables in the zip files provided
at https://iso639-3.sil.org/code_tables/639/data was broken (contained
lines with inconsistent numbers of tabs but no content - which is the
reason for this line
https://github.com/clld/clldutils/blob/93d3789175103d6f60eb33ef7f4779177ec9993f/src/clldutils/iso_639_3.py#L52
in my processing code). I notified SIL about this but never got an
answer. Given this, I wouldn't have too high hopes in an RDF dump
provided by SIL.

Felix Sasaki

unread,
Jul 8, 2020, 5:42:19 AM7/8/20
to Christian Chiarcos, open-linguistics, Linked Data for Language Technology Community Group, public-...@w3.org
Dear Christian and all,

my preference would be "- Approach IANA about an RDF edition of the BCP47 subtag registry ".

Btw., since we had a mail exchange about the topic a while ago, there has been a discussion in the W3C i18n working group

At the moment that group is working on guidance about language tags and locale identifiers, in which RDF related guidance would fit very well, see

Best,

Felix

Gilles Sérasset

unread,
Jul 8, 2020, 5:48:07 AM7/8/20
to Christian Chiarcos, open-linguistics, Linked Data for Language Technology Community Group, public-...@w3.org
Hi Christian, hi all,

Wouldn’t it be nice if the lexvo.org domain was managed by a group of persons from the LLOD area to provide linked data on the languages that would be an aggregation of all the datasets you mentioned, along with all “sameAs” relations ?

I think of a semi-automatic process (a la DBnary) that will update its data from CSVs and other already available linked datasets every month or so and provide an always up to date registry ?

Moreover, the LOC linked data is quite poor compared to what lexvo had (for instance, the English language names “variants” are only available in English, French and German.

This solution will involve a dedicated team of maintainers (on the long run) and a rather small infrastructure to provide the data (which could be simply served from static files + content negotiation). It assumes that the generation of URIs and accompanying data can be made entirely automatically (which may not be the case if there are name clashes among these). It also assumes that the different dataset licences allows for it (which I am unsure regarding SIL…).

I also think that such an alternate dataset may be necessary for other persons who will need to have more information attached to the language they deal with (e.g. date annotations for Historical languages, geographical (space/time) annotation for all languages, etc.).

Regards,

Gilles,

Christian Chiarcos

unread,
Jul 8, 2020, 7:24:29 AM7/8/20
to Gilles Sérasset, open-linguistics, Linked Data for Language Technology Community Group, public-...@w3.org
Am .07.2020, 11:46 Uhr, schrieb Gilles Sérasset
<Gilles....@univ-grenoble-alpes.fr>:

> Hi Christian, hi all,
>
> Wouldn’t it be nice if the lexvo.org domain was managed by a group of
> persons from the LLOD area to provide linked data on the languages that
> would be an aggregation of all the datasets you mentioned, along with
> all “sameAs” relations ?

Definitely, it might find support in this community (definitely mine), and
as you describe it, it is not even be a big effort to create that. But the
question is how to make that sustainable and to keep it alive (maintained
and funded) in the long run.

> This solution will involve a dedicated team of maintainers (on the long
> run) and a rather small infrastructure to provide the data (which could
> be simply served from static files + content negotiation).

I think it would also require some kind of organizational commitment to
keep it alive on a technical level. This would be one of the strengths of
IANA or (maybe) SIL. There may be other alternatives to these, though.

> It assumes that the generation of URIs and accompanying data can be made
> entirely automatically (which may not be the case if there are name
> clashes among these).

ISO 693 codes should not clash
(https://www.loc.gov/standards/iso639-2/iso639jac.html).

> It also assumes that the different dataset licences allows for it (which
> I am unsure regarding SIL…).

The terms of use (https://iso639-3.sil.org/code_tables/download_tables)
permit commercial and non-commercial use with attribution and without
modification, but require that "the product, system, or device does not
provide a means to redistribute the code set."

I am not sure what this means. Clearly lexvo and the datahub ISO tables
provide a means to reconstruct the full code set, but apparently that
hasn't been an issue in the last 10 years, also because these are no
verbatim copies.

> I also think that such an alternate dataset may be necessary for other
> persons who will need to have more information attached to the language
> they deal with (e.g. date annotations for Historical languages,
> geographical (space/time) annotation for all languages, etc.).

Absolutely. Glottolog has been a great step in this direction for minority
languages, but for historical languages, nothing really is in existence.
But maybe let's separate the discussions for extending ISO 693 data (which
is necessary on many dimensions) from the question how to create
sustainable identifiers. I could imagine existing organizations taking
care of just providing an RDF view on ISO 639-3 data, but everything
beyond that probably requires external funding (and of course, this is
something we can work towards, too).

Best,
Christian

Christian Chiarcos

unread,
Jul 8, 2020, 7:29:46 AM7/8/20
to Felix Sasaki, open-linguistics, Linked Data for Language Technology Community Group, public-...@w3.org
Hi Felix, dear all,

Am .07.2020, 10:47 Uhr, schrieb Felix Sasaki <fe...@sasakiatcf.com>:

my preference would be "- Approach IANA about an RDF edition of the BCP47 subtag registry ".

Btw., since we had a mail exchange about the topic a while ago, there has been a discussion in the W3C i18n working group

At the moment that group is working on guidance about language tags and locale identifiers, in which RDF related guidance would fit very well, see

On a technical level, this would be my preference, too, mostly because it's consistent with language tag definitions for RDF in general, but also because it is uncertain how much technical support we can expect from SIL. However, two questions:
- As for IANA updates (I see there are some recent ones: https://www.iana.org/assignments/lang-subtags-templates/lang-subtags-templates.xhtml): Is that actively synchronized with ISO 693-3 (etc) or does it have to be triggered by a change request (and if so, by whom and how)?
- What would be a realistic timeline to expect any results (I mean RDF data) from  i18n? Is there any way we can support that?

Best,
Christian

Robert Forkel

unread,
Jul 8, 2020, 7:33:47 AM7/8/20
to Christian Chiarcos, Gilles Sérasset, open-linguistics, Linked Data for Language Technology Community Group, public-...@w3.org
Regarding the license terms for the ISO 639-3 code tables: This weird
"the product, system, or device does not provide a means to
redistribute the code set." clause is basically what kept me from
including the ISO code tables in
https://github.com/glottolog/glottolog - although our curation
software downloads and uses these to validate the Glottolog data. If
it were not for this, Glottolog might be a place with some sort of
institutional support that could provide resolvable URLs for all ISO
codes. We are working towards having complete coverage of ISO 639-3 -
even if this might mean "not assessed yet" or "bookkeeping" status for
the associated Glottolog languoid.
> --
> You received this message because you are subscribed to the Google Groups "open-linguistics" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to open-linguisti...@googlegroups.com.
> To view this discussion on the web, visit https://groups.google.com/d/msgid/open-linguistics/op.0nfom0i1br5td5%40kitaba.

santhosh....@gmail.com

unread,
Jul 8, 2020, 7:51:22 AM7/8/20
to open-linguistics
 How about Wikidata(https://www.wikidata.org/)? For example https://www.wikidata.org/wiki/Q36236 is for Malayalam and has linked to several identifiers.

Christian Chiarcos

unread,
Jul 8, 2020, 8:12:07 AM7/8/20
to santhosh....@gmail.com, open-linguistics, Linked Data for Language Technology Community Group, public-...@w3.org
I think most people would prefer URIs that maintain the ISO acronym, because it is established and because it can still be interpreted if the URLs don't resolve anymore. But the Wikidata entry for Malayalam is interesting for another reason: It refers to the Publications Office of the European Union as the source of its ISO 639 identifiers, and they seem indeed to provide URIs for some form of ISO 639-based language identifiers, e.g., http://publications.europa.eu/resource/authority/language/AVA for ISO 639 ava. However, they seem to be focusing on ISO 639-2 languages only (e.g., they have neither http://publications.europa.eu/resource/authority/language/CSP for ISO 639-3 csp nor http://publications.europa.eu/resource/authority/language/AV for ISO 639-1 av), so this doesn't seem to an appealing alternative either.

Best,
Christian

Am Mi., 8. Juli 2020 um 13:51 Uhr schrieb <santhosh....@gmail.com>:
 How about Wikidata(https://www.wikidata.org/)? For example https://www.wikidata.org/wiki/Q36236 is for Malayalam and has linked to several identifiers.

--
You received this message because you are subscribed to the Google Groups "open-linguistics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-linguisti...@googlegroups.com.

Christian Chiarcos

unread,
Aug 7, 2020, 10:27:33 AM8/7/20
to Felix Sasaki, Ronan Power, santhosh....@gmail.com, open-linguistics, Linked Data for Language Technology Community Group, public-...@w3.org
Am .08.2020, 15:30 Uhr, schrieb Ronan Power <Ro...@translation.ie>:

Hi, I wrote on this before to the group:

I think it’s important to realise that ISO639-3 does indeed have its problems, not least of which is the “apparent” descriptor<>tag mismatch as do the alternatives and variants, and it is confusing.


Yes, I think most linguists who work on non-major languages have encountered such problems (if they tried to make the language explicit, that is). Yet, for the moment, ISO 639 is extremely important in that it is an inventory that is agreed upon. Despite its flaws, reaching agreement on another system would be a massive undertaking, if possible at all. 

 This really boils down to the creation and agreement of a source index of identifiers for languages, dialects, written languages and scripts, of which to my knowledge no such system has yet been completed thoroughly.


The closest thing to that is Glottolog, and it does a good job on minority languages, but not so much on historical languages.

Actually, one nice thing about BCP47 is that is allows to provide custom language tags, and I recently found myself creating such language tags by combining the closest BCP47 language tag with the actual Glottocode, e.g., 

"как бази"@av-x-ancu1238 for the Ancux dialect (Glottocode  ancu1238) of Avar (ISO 639-3 ava, ISO 639-1 av)

but 

"кокази"@av for "standard" Avar (Glottocode avar1256, ISO 639-3 ava, ISO 639-1 av)

And for languages for which no ISO 639 code can be found, (e.g., Okinawan, because this is not a dialect of Japanese in Glottolog but a sibling language in the Japonic language family), the placeholder tag "mis" (uncoded) can used, i.e., mis-x-okin1244

This is nice insofar as this approach allows to provide a BCP47 code for every Glottolog language variety without information loss (because we can retrieve the ISO 639-3 code for every Glottolog languoid by finding the closest parent node that has an ISO 639-3 code attached, and from there, we can find the ISO 639-1 codes using SIL conversion tables or lexvo).

And not only does that approach use conventional BCP47 tags wherever possible, but the custom extension with Glottolog yields actually valid BCP47 tags, too (after -x- you can add whatever you like). Moreover, it is possible to resolve this to a URI (because all Glottocodes do, https://glottolog.org/resource/languoid/id/ancu1238, and we can retrieve the Glottocode for every ISO 639-3 code from Glottolog [and, via, SIL conversion tables, from ISO 639-1 codes]).

The remaining difficulties are that
(a) Glottolog is far from perfect either [for historical languages, the Glottolog classification tries to harmonize diachronic and synchronic relations, and this does not always lead to a consistent result],
(b) this is a hack rather than a solution, because there is no formal way to assert that the elements following -x- are Glottocodes in BCP47, and
(c) if we know that something is a Glottocode, we can *reconstruct* its URI and browse the Glottolog classification, but there is no good way to make this information explicit.

A nicer solution, therefore, would be to just link an entry with the URIs for ISO language code, Glottocode and other metadata, so that all information is explicit ;) Having standard URIs for ISO 639 tags would be the first step.

Best,
Christian

Ronan Power

unread,
Aug 11, 2020, 7:50:25 AM8/11/20
to Felix Sasaki, Christian Chiarcos, santhosh....@gmail.com, open-linguistics, Linked Data for Language Technology Community Group, public-...@w3.org

Hi, I wrote on this before to the group:

I think it’s important to realise that ISO639-3 does indeed have its problems, not least of which is the “apparent” descriptor<>tag mismatch as do the alternatives and variants, and it is confusing.

I have adopted ISO639-3 previously, however I was forced to adopt a hybrid version including all available language tags from all systems in an application we were building, and, we allowed for different “PROPER-NOUNS” in “any language” to be added as a altTag for any of those languages.

However I think it is important to realise that 639-3 does by far the better job of having the most scope of languages but in my opinion we are dealing with “spoken languages” here, even though many languages are accurately represented as written languages too. Whereas I sense some confusion between this and the differences between similar language mapping widely used in HTML/XML some of the larger multinational localisation vendors and the “localisation industry status quo” in general such as (lang_country) mappings like (ES_MX) or (EN_GB, EN_US) etc. In addition, in my opinion, another principal issue here is the point of view of the culture defining the standard, i.e. a westernised English speaking point of view, which of course is based on text mapping assumptions on ISO text mappings and character sets.  E.g., where does something like “written traditional Chinese vs Simplified Chinese” come into any of the systems referred to above.

 

This really boils down to the creation and agreement of a source index of identifiers for languages, dialects, written languages and scripts, of which to my knowledge no such system has yet been completed thoroughly.

 

 

That’s My 2 cents of opinion.  Please feel free to reach out to me if you feel strongly about this, and I apologise if I have offended anybody.

 

Kind regrads

Ronan

 

From: Felix Sasaki [mailto:fe...@sasakiatcf.com]
Sent: Friday 7 August 2020 12:32
To: Christian Chiarcos <christian...@web.de>
Cc: santhosh....@gmail.com; open-linguistics <open-lin...@googlegroups.com>; Linked Data for Language Technology Community Group <public...@w3.org>; public-...@w3.org
Subject: Re: [open-linguistics] Re: ISO 639 URIs

 

Dear Christian and all,

 

FYI and in case you have further comments, I brought this thread to the attention of the W3C i18n working group, see this issue

also, W3C has started work again on a draft about "language tags and locale identifiers", see the editors copy here

that version contains also some guidance about working with language tags in the context of RDF, see 

 

Feel free to provide feedback here or within the W3C GitHub, we'd be more than happy to take this into account.

 

Best,


Felix

Felix Sasaki

unread,
Aug 11, 2020, 7:50:26 AM8/11/20
to Christian Chiarcos, santhosh....@gmail.com, open-linguistics, Linked Data for Language Technology Community Group, public-...@w3.org
Dear Christian and all,

FYI and in case you have further comments, I brought this thread to the attention of the W3C i18n working group, see this issue
also, W3C has started work again on a draft about "language tags and locale identifiers", see the editors copy here
that version contains also some guidance about working with language tags in the context of RDF, see 
 
Feel free to provide feedback here or within the W3C GitHub, we'd be more than happy to take this into account.

Best,

Felix

On Wed, 8 Jul 2020 at 14:11, Christian Chiarcos <christian...@web.de> wrote:
Reply all
Reply to author
Forward
0 new messages