Doubt about supported languages in CC

Cristian García Romero

unread,

Sep 29, 2023, 3:53:14 PM9/29/23

to Common Crawl

Hello!

I was wondering why CC supports only 160 languages (https://commoncrawl.github.io/cc-crawl-statistics/plots/languages.html) as CLD2 supports up to 174 languages (https://github.com/CLD2Owners/cld2/wiki/January-2014-Release-Notes or https://github.com/CLD2Owners/cld2/blob/master/internal/generated_language.h).

I'd be very grateful if you could answer my doubt.

Thank you in advance!

Sebastian Nagel

unread,

Sep 30, 2023, 1:22:05 AM9/30/23

to common...@googlegroups.com

Hi Cristian,

the 2014 release notes state that "174 language-script combinations" are
supported. There are languages which can be written in multiple scripts.
For example, CLD2 is able to detect Uzbek written using Arabic, Cyrillic
or Latin letters. See line 251 and following in one of the unit test
files [1]. Looking at the comments in the unit tests, it's also clear
that the CLD2 identifier was not trained for all of the defined languages.

Btw., CC uses the Debian / Ubuntu package of CLD2 [2,3] (preloading
the library supporting 160 languages). The Java wrapper is available
at [4].

Best,
Sebastian

[1]
https://github.com/CLD2Owners/cld2/blob/master/internal/cld2_unittest_full.cc#L251C1-L251C1
[2] https://packages.debian.org/bookworm/libcld2-0
[3] https://packages.ubuntu.com/source/jammy/cld2
[4] https://github.com/commoncrawl/language-detection-cld2

On 9/29/23 21:53, Cristian García Romero wrote:
> Hello!
>
> I was wondering why CC supports only 160 languages
> (https://commoncrawl.github.io/cc-crawl-statistics/plots/languages.html

> <https://commoncrawl.github.io/cc-crawl-statistics/plots/languages.html>) as CLD2 supports up to 174 languages (https://github.com/CLD2Owners/cld2/wiki/January-2014-Release-Notes <https://github.com/CLD2Owners/cld2/wiki/January-2014-Release-Notes> or https://github.com/CLD2Owners/cld2/blob/master/internal/generated_language.h <https://github.com/CLD2Owners/cld2/blob/master/internal/generated_language.h>).

Katia Billeci

unread,

Sep 30, 2023, 5:11:27 PM9/30/23

to common...@googlegroups.com

Unsubscribe

Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android

From: common...@googlegroups.com <common...@googlegroups.com> on behalf of Cristian García Romero <cgr...@gmail.com>
Sent: Friday, September 29, 2023 3:53:14 PM
To: Common Crawl <common...@googlegroups.com>
Subject: [cc] Doubt about supported languages in CC

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/9806872f-a4c4-4fb3-b977-fdb190264ce9n%40googlegroups.com.

Cristian García Romero

unread,

Oct 2, 2023, 6:57:16 AM10/2/23

to Common Crawl

Hi, Sebastian,

Thanks for the clarification! Now I think I understand the numbers better (my mind skipped the word "script" -.-). Still, I found the problem that some languages are supported by CLD2 but I can't find them in CC (Ewe language is an example that I see supported in the CLD2 code [1], Java CLD2 wrapper [2] and CLD2 Debian library [3]). However, the languages I've detected with this situation (ewe, gaa, kri, loz, lua, luo, new, oss, pam, raj, tum and twi) are just the languages that no script was used in the training process of CLD2 [5]. Is this assumption correct that languages not used in CLD2 training are not supported in CC? If so, does this come from the CLD2 wrapper, the CLD2 Debian library or the CC scripts? Another difference I observe is that traditional and simplified Chinese have been combined under "zho" in the CLD2 Java wrapper [6][7], since there is no way to distinguish between the two scripts in ISO 639-2 [8]. Something similar seems to happen with Serbian and Montenegrin, which have been assigned the same ISO 639-2 code in the CLD2 Java wrapper [9], but I suspect that the reason for this is language proximity, since each of these languages has its own ISO 639-2 code (srp and cnr, respectively) [10].

Then:

- From the initial 174 language scripts supported by CLD2, a total of 163 different languages are available according to [4].

- Languages not seen in training are not included in this evaluation.

- If we combine traditional with simplified Chinese and Serbian with Montenegrin, we have a total of 161 languages + "unknown".

- According to [7], 161 languages + "unknown" are supported by CC.

- Detected languages in CC does not detect script, but language.

I think the number now makes sense with the clarification of the language scripts. I'd appreciate it if you could tell me if I said something wrong.

Thank you!

Best,

Cristian.

[1] https://github.com/CLD2Owners/cld2/blob/b56fa78a2fe44ac2851bae5bf4f4693a0644da7b/internal/generated_language.h#L198

[2] https://github.com/commoncrawl/language-detection-cld2/blob/296f71bfd17c27db166053bb911fcb06ca999528/src/main/java/org/commoncrawl/langdetect/cld2/Language.java#L204

[3] http://deb.debian.org/debian/pool/main/c/cld2/cld2_0.0.0-git20150806.orig.tar.gz

[4] https://github.com/CLD2Owners/cld2/blob/master/docs/evaluate_cld2_large_20140122.txt

[5] https://github.com/CLD2Owners/cld2/blob/b56fa78a2fe44ac2851bae5bf4f4693a0644da7b/internal/cld2_unittest_full.cc#L122

[6] https://github.com/commoncrawl/language-detection-cld2/blob/296f71bfd17c27db166053bb911fcb06ca999528/src/main/java/org/commoncrawl/langdetect/cld2/Language.java#L107

[7] https://commoncrawl.github.io/cc-crawl-statistics/plots/languages.html

[8] https://chinese.stackexchange.com/questions/6147/which-one-of-these-two-iso-639-2-code-refers-to-traditional-chinese-chi-or-zho

[9] https://github.com/commoncrawl/language-detection-cld2/blob/296f71bfd17c27db166053bb911fcb06ca999528/src/main/java/org/commoncrawl/langdetect/cld2/Language.java#L198

[10] https://www.loc.gov/standards/iso639-2/php/code_list.php

Sebastian Nagel

unread,

Oct 8, 2023, 3:56:19 PM10/8/23

to common...@googlegroups.com

Hi Cristian,

sorry for the delayed response...

> Still, I found the problem that some languages are supported by CLD2
> but I can't find them in CC (Ewe language is an example that I see
> supported in the CLD2 code [1], Java CLD2 wrapper [2] and CLD2 Debian
> library [3]).

I've tried to identify an Ewe text (a bible fragment) using both
the Java wrapper and the Python module pycld2: the text is
(mis)identified as Akan (aka). But note that I definitely feel not
competent to really evaluate the results.

> Is this assumption correct that languages not used in CLD2 training
> are not supported in CC?

I cannot say anything about CLD2 training. The CLD2 library is mostly
a black box for me: text in, language code out.

> If so, does this come from the CLD2 wrapper, the CLD2 Debian library
> or the CC scripts?

The fact, that the Python and the Java wrapper would be equally wrong,
suggests that Ewe is simply not supported by the underlying CLD2
library, independent whether it's built from source or whether the
shared object provided by Debian is used.

> Another difference I observe is that traditional and
> simplified Chinese have been combined under "zho" in the CLD2 Java
> wrapper [6][7], since there is no way to distinguish between the two
> scripts in ISO 639-2 [8].

To keep it simple in the index, and the cc-crawl-statistics are a
summary of the index, all language codes have been normalized to one
of the available 3-letter ISO 639-2 codes. You find the original code
from CLD2 in the WARC metadata records, e.g.

languages-cld2:
{"reliable":true,"text-bytes":913,"languages":[{"code":"zh-Hant","code-iso-639-3":"zho","text-covered":0.96,"score":1899.0,"name":"ChineseT"}]}

> Something similar seems to happen with Serbian and Montenegrin, which
> have been assigned the same ISO 639-2 code in the CLD2 Java wrapper
> [9], but I suspect that the reason for this is language proximity,
> since each of these languages has its own ISO 639-2 code (srp and
cnr, > respectively) [10].

This might be also a mistake by me when assigning the ISO 639-2 codes.
Feel free to file an issue for the Java wrapper on Github. See for
comparison:
https://github.com/commoncrawl/language-detection-cld2/pull/4
which was fixed starting with

https://commoncrawl.org/2021/03/february-march-2021-crawl-archive-now-available/

I've passed a Montenegrin sample text to the Java wrapper and pycld2
- written in Latin script, see
https://en.wikipedia.org/wiki/Montenegrin_language#Sample_text)
- and the Cyrillic transliteration
Сва људска бића рађају се слободна и једнака у достојанству и правима.
Она су обдарена разумом и савјешћу и једни према другима треба да
поступају у духу братства.
The text in Latin script is identified as Bosnian, the Cyrillic one as
Serbian.

> - Detected languages in CC does not detect script, but language.

I think this is true for CLD2 in general, even if the script is used
as an additional (or even discriminatory) signal.

Best,
Sebastian

> *From:* common...@googlegroups.com <common...@googlegroups.com> on

> behalf of Cristian García Romero <cgr...@gmail.com>

> *Sent:* Friday, September 29, 2023 3:53:14 PM
> *To:* Common Crawl <common...@googlegroups.com>
> *Subject:* [cc] Doubt about supported languages in CC

> Hello!
>
> I was wondering why CC supports only 160 languages

> (https://commoncrawl.github.io/cc-crawl-statistics/plots/languages.html <https://commoncrawl.github.io/cc-crawl-statistics/plots/languages.html>) as CLD2 supports up to 174 languages (https://github.com/CLD2Owners/cld2/wiki/January-2014-Release-Notes <https://github.com/CLD2Owners/cld2/wiki/January-2014-Release-Notes> or https://github.com/CLD2Owners/cld2/blob/master/internal/generated_language.h <https://github.com/CLD2Owners/cld2/blob/master/internal/generated_language.h>).
>

Cristian García Romero

unread,

Oct 9, 2023, 6:37:00 AM10/9/23

to Common Crawl

Hi, Sebastian,

Thank you for the detailed answers! It helped me to understand better the situation!

About the Montenegrin language, I'm not sure if is a mistake. I've been checking out the CLD2 code and thare are multiple hints that lead to think that it is intended to detect CLD2 as Serbian:

- `public/compact_lang_det.h`: // MONTENEGRIN is not detected as such, but likely scores as Serbian.

- `internal/cld2_unittest_full.cc`: // Not trained {MONTENEGRIN, kTeststr_sr_ME_Latn}, // Not recognized as distinct from Croatian/Serbian

- `internal/compact_lang_det_hint_code.cc`: // BOSNIAN CROATIAN MONTENEGRIN SERBIAN detecting just CROATIAN SERBIAN

- `internal/compact_lang_det_impl.cc`: // use SERBO_CROATIAN instead of BOSNIAN, SERBIAN, CROATIAN, MONTENEGRIN(21)

- https://github.com/CLD2Owners/cld2/wiki/October-2014-Release-Notes: "There is very little text for some languages (URLs invited): [...] Montenegrin (Latn and Cyrl scripts), [...].

Best,

Cristian.

Tom Morris

unread,

Oct 9, 2023, 11:30:31 PM10/9/23

to common...@googlegroups.com

I agree that Montenegrin is intentionally excluded.

The accuracy table [1] that was posted earlier is the key thing to
look at to understand what languages are supported by CLD2 (and how
accurate it is for each). There are 163 languages supported, although
two of those are Klingon and Pig Latin, which I wouldn't count.
CommonCrawl also combines Simplified Chinese & Traditional Chinese, as
was pointed out earlier, which brings the number down to 160. Ewe is
not listed in the table.

Uzbek is supported in three scripts and nine other languages are
supported with two scripts each, but, as was noted, the script isn't
returned. Not all languages which use multiple writing systems have
them all supported. For example, only the Gurmukhi-based version of
Punjabi is supported by CLD2, not the version using the Shahmukhi
writing system.

As an aside, it might be worth looking at the Fasttext langid model
which is much more accurate than CLD2 (98+% vs 87%) as noted in this
benchmark [2] (which, unfortunately, used the smaller 83 language
model for CLD2, not the bigger model.

Best,
Tom

[1] https://github.com/CLD2Owners/cld2/blob/master/docs/evaluate_cld2_large_20140122.txt
[2] https://modelpredict.com/language-identification-survey#conclusion

Reply all

Reply to author

Forward