Hi Javier,
> Given these differences (specially dramatic for Norwegian and Swedish), we were wondering whether these
> errors were fixed in later dumps, and if so, if there is a way to know which dump is the first one to
> have these character encoding errors fixed.
The charset detection required to properly transform non-UTF-8 HTML pages in WARC files into WET records didn't work before November 2016
(CC-MAIN-2016-50) due to a bug, see [1]. There should be significantly
less errors in this and all later crawls.
Additional notes:
1. since August/September 2018 WARC files and the URL indexes include the identified content language(s). You could simply pick the WARC
records of pages written in your target languages and parse the HTML to extract the textual content. Since the amount of records for these
languages is relatively small [2], this approach should be faster. There's an excellent blog post [3] by Athul Jayson who performed this
task for Malayam pages. But you might also have a look at [4,5].
2. starting with the May/June 2020 crawl, also WET files include language tags
Given all these bug fixes and improvements, I'd recommend to start with newest crawls first.
> the first ever dump (2013-20)
There's also data in the ARC file format from crawls in 2008 - 2012.
You might also look at some of the other initiatives to build natural language corpora from
Common Crawl, notably [7,8,9].
Best,
Sebastian
[1]
https://github.com/commoncrawl/ia-web-commons/issues/4
[2]
https://commoncrawl.github.io/cc-crawl-statistics/plots/languages
[3]
https://blog.qburst.com/2020/07/extracting-data-from-common-crawl-dataset/
[4]
https://github.com/commoncrawl/cc-index-table#export-subsets-of-the-common-crawl-archives
[5]
https://github.com/commoncrawl/cc-pyspark/blob/master/cc_index_word_count.py
[6]
https://commoncrawl.org/2020/06/may-june-2020-crawl-archive-now-available/
[7]
https://oscar-corpus.com/
[8]
https://www.earthlings.io/
[9]
http://data.statmt.org/cc-100/
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
common-crawl...@googlegroups.com
> <mailto:
common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/common-crawl/8c8170d0-de23-4853-aa2d-ab67e1391c0fn%40googlegroups.com
> <
https://groups.google.com/d/msgid/common-crawl/8c8170d0-de23-4853-aa2d-ab67e1391c0fn%40googlegroups.com?utm_medium=email&utm_source=footer>.