Hi John,
WET files include the plain text from HTML pages contained in the
corresponding WARC files. Web pages are randomly distributed over WARC
files, so this applies also to the WETs.
> Please tell me, how can i get WET files for some specific language?
>
> For example i want all WET files for language Estonian ("languages":
> "est).
If you plan to process all WET files any way, eg. because you're
interested also in the content of other languages as well:
- since May 2020 the WET files include the content language(s) detected
by CLD2 in the header, e.g.
WARC-Identified-Content-Language: est,eng
Otherwise, there are two options:
1. You could use the URL index to get the offsets of pages with Estonian
content, fetch the WARC record and parse the HTML content. For details, see
https://groups.google.com/g/common-crawl/c/tAO6VaAw3WA/m/Untz_qt6EwAJ
2. There are corpora derived from Common Crawl which are partitioned by
language, eg.
https://oscar-corpus.com/
https://www.earthlings.io/
http://data.statmt.org/cc-100/
Best,
Sebastian