Download files for specific language

285 views
Skip to first unread message

John Walter

unread,
Feb 11, 2023, 4:40:01 AM2/11/23
to Common Crawl
Hello CC Team.

Please tell me, how can i get WET files for some specific language?

For example i want all WET files for language Estonian ("languages": "est).

Thank you.

Sebastian Nagel

unread,
Feb 14, 2023, 9:36:39 AM2/14/23
to common...@googlegroups.com
Hi John,

WET files include the plain text from HTML pages contained in the
corresponding WARC files. Web pages are randomly distributed over WARC
files, so this applies also to the WETs.

> Please tell me, how can i get WET files for some specific language?
>
> For example i want all WET files for language Estonian ("languages":
> "est).

If you plan to process all WET files any way, eg. because you're
interested also in the content of other languages as well:
- since May 2020 the WET files include the content language(s) detected
by CLD2 in the header, e.g.
WARC-Identified-Content-Language: est,eng

Otherwise, there are two options:

1. You could use the URL index to get the offsets of pages with Estonian
content, fetch the WARC record and parse the HTML content. For details, see
https://groups.google.com/g/common-crawl/c/tAO6VaAw3WA/m/Untz_qt6EwAJ

2. There are corpora derived from Common Crawl which are partitioned by
language, eg.
https://oscar-corpus.com/
https://www.earthlings.io/
http://data.statmt.org/cc-100/


Best,
Sebastian
Reply all
Reply to author
Forward
0 new messages