Hi Cristian,
sorry for the delayed response...
> Still, I found the problem that some languages are supported by CLD2
> but I can't find them in CC (Ewe language is an example that I see
> supported in the CLD2 code [1], Java CLD2 wrapper [2] and CLD2 Debian
> library [3]).
I've tried to identify an Ewe text (a bible fragment) using both
the Java wrapper and the Python module pycld2: the text is
(mis)identified as Akan (aka). But note that I definitely feel not
competent to really evaluate the results.
> Is this assumption correct that languages not used in CLD2 training
> are not supported in CC?
I cannot say anything about CLD2 training. The CLD2 library is mostly
a black box for me: text in, language code out.
> If so, does this come from the CLD2 wrapper, the CLD2 Debian library
> or the CC scripts?
The fact, that the Python and the Java wrapper would be equally wrong,
suggests that Ewe is simply not supported by the underlying CLD2
library, independent whether it's built from source or whether the
shared object provided by Debian is used.
> Another difference I observe is that traditional and
> simplified Chinese have been combined under "zho" in the CLD2 Java
> wrapper [6][7], since there is no way to distinguish between the two
> scripts in ISO 639-2 [8].
To keep it simple in the index, and the cc-crawl-statistics are a
summary of the index, all language codes have been normalized to one
of the available 3-letter ISO 639-2 codes. You find the original code
from CLD2 in the WARC metadata records, e.g.
languages-cld2:
{"reliable":true,"text-bytes":913,"languages":[{"code":"zh-Hant","code-iso-639-3":"zho","text-covered":0.96,"score":1899.0,"name":"ChineseT"}]}
> Something similar seems to happen with Serbian and Montenegrin, which
> have been assigned the same ISO 639-2 code in the CLD2 Java wrapper
> [9], but I suspect that the reason for this is language proximity,
> since each of these languages has its own ISO 639-2 code (srp and
cnr, > respectively) [10].
This might be also a mistake by me when assigning the ISO 639-2 codes.
Feel free to file an issue for the Java wrapper on Github. See for
comparison:
https://github.com/commoncrawl/language-detection-cld2/pull/4
which was fixed starting with
https://commoncrawl.org/2021/03/february-march-2021-crawl-archive-now-available/
I've passed a Montenegrin sample text to the Java wrapper and pycld2
- written in Latin script, see
https://en.wikipedia.org/wiki/Montenegrin_language#Sample_text)
- and the Cyrillic transliteration
Сва људска бића рађају се слободна и једнака у достојанству и правима.
Она су обдарена разумом и савјешћу и једни према другима треба да
поступају у духу братства.
The text in Latin script is identified as Bosnian, the Cyrillic one as
Serbian.
> - Detected languages in CC does not detect script, but language.
I think this is true for CLD2 in general, even if the script is used
as an additional (or even discriminatory) signal.
Best,
Sebastian
> *From:*
common...@googlegroups.com <
common...@googlegroups.com> on
> *Sent:* Friday, September 29, 2023 3:53:14 PM
> *To:* Common Crawl <
common...@googlegroups.com>
> *Subject:* [cc] Doubt about supported languages in CC
> Hello!
>
> I was wondering why CC supports only 160 languages