Character encoding errors in old dumps

42 views
Skip to first unread message

Javier de la Rosa

unread,
Mar 5, 2021, 6:21:35 AM3/5/21
to Common Crawl
Hi,

At the National Library of Norway we are training language models for the Nordic languages and need vast amounts of textual data. Therefore, we decided to compliment our own resources with CommonCrawl data. We started processing the first ever dump (2013-20) using Google Dataflow, but the results contain plenty of character encodings errors that are not fixable since the byte information is lost and no matter what encoding we use the result is always the same (see attached example). This is vital for a language like Norwegian, where non-ASCII characters are pervasive, but it's also the case for Icelandic, Danish, and Swedish.

After removing near duplicates, we compared the size of text per language before and after removing documents with character encoding errors:
- Danish: 3.8G before vs 3.1G after
- Icelandic: 365M, 264M
- Norwegian: 3.6G, 2.9G
- Swedish: 9.3G, 8.0G

Given these differences (specially dramatic for Norwegian and Swedish), we were wondering whether these errors were fixed in later dumps, and if so, if there is a way to know which dump is the first one to have these character encoding errors fixed.

Cheers.
sample.txt

Sebastian Nagel

unread,
Mar 5, 2021, 8:07:05 AM3/5/21
to common...@googlegroups.com
Hi Javier,

> Given these differences (specially dramatic for Norwegian and Swedish), we were wondering whether these
> errors were fixed in later dumps, and if so, if there is a way to know which dump is the first one to
> have these character encoding errors fixed.

The charset detection required to properly transform non-UTF-8 HTML pages in WARC files into WET records didn't work before November 2016
(CC-MAIN-2016-50) due to a bug, see [1]. There should be significantly
less errors in this and all later crawls.

Additional notes:

1. since August/September 2018 WARC files and the URL indexes include the identified content language(s). You could simply pick the WARC
records of pages written in your target languages and parse the HTML to extract the textual content. Since the amount of records for these
languages is relatively small [2], this approach should be faster. There's an excellent blog post [3] by Athul Jayson who performed this
task for Malayam pages. But you might also have a look at [4,5].

2. starting with the May/June 2020 crawl, also WET files include language tags

Given all these bug fixes and improvements, I'd recommend to start with newest crawls first.

> the first ever dump (2013-20)

There's also data in the ARC file format from crawls in 2008 - 2012.

You might also look at some of the other initiatives to build natural language corpora from
Common Crawl, notably [7,8,9].

Best,
Sebastian


[1] https://github.com/commoncrawl/ia-web-commons/issues/4
[2] https://commoncrawl.github.io/cc-crawl-statistics/plots/languages
[3] https://blog.qburst.com/2020/07/extracting-data-from-common-crawl-dataset/
[4] https://github.com/commoncrawl/cc-index-table#export-subsets-of-the-common-crawl-archives
[5] https://github.com/commoncrawl/cc-pyspark/blob/master/cc_index_word_count.py
[6] https://commoncrawl.org/2020/06/may-june-2020-crawl-archive-now-available/
[7] https://oscar-corpus.com/
[8] https://www.earthlings.io/
[9] http://data.statmt.org/cc-100/
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/8c8170d0-de23-4853-aa2d-ab67e1391c0fn%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/8c8170d0-de23-4853-aa2d-ab67e1391c0fn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Javier de la Rosa

unread,
Mar 5, 2021, 8:11:37 AM3/5/21
to Common Crawl
Hi Sebastian,

Thank you so much for the super informative response. I'll read all the resources you link to and make a decision about the best way to build our corpus.

Cheers.

Sebastian Nagel

unread,
Mar 5, 2021, 8:22:09 AM3/5/21
to common...@googlegroups.com
Hi Javier,

> make a decision about the best way to build our corpus

Great! - and feel free to come back with any further questions.

Best,
Sebastian
> [1] https://github.com/commoncrawl/ia-web-commons/issues/4 <https://github.com/commoncrawl/ia-web-commons/issues/4>
> [2] https://commoncrawl.github.io/cc-crawl-statistics/plots/languages <https://commoncrawl.github.io/cc-crawl-statistics/plots/languages>
> [7] https://oscar-corpus.com/ <https://oscar-corpus.com/>
> [8] https://www.earthlings.io/ <https://www.earthlings.io/>
> [9] http://data.statmt.org/cc-100/ <http://data.statmt.org/cc-100/>
> <https://groups.google.com/d/msgid/common-crawl/8c8170d0-de23-4853-aa2d-ab67e1391c0fn%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/8c8170d0-de23-4853-aa2d-ab67e1391c0fn%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/28230a3a-d6e1-4ebd-8e3e-089cc1575981n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/28230a3a-d6e1-4ebd-8e3e-089cc1575981n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Reply all
Reply to author
Forward
0 new messages