Hi Henry,
could you share some details what exactly is missing?
I've just looked into the CDX and Parquet indexes of CC-MAIN-2022-33
but the fields / columns "languages" and "encoding" resp.
"content_languages" and "content_encoding" are there. See one example below.
Please note, that only successful fetches and HTML pages are passed to
the charset and language detectors. Redirects, 404s, robots.txt
captures, PDFs, etc. are without these fields resp. contain null values.
Best,
Sebastian
{"urlkey": "org,commoncrawl)/", "timestamp": "20220809171502", "url":
"
https://commoncrawl.org/", "mime": "text/html", "mime-detected":
"text/html", "status": "200", "digest":
"754LZOH7HB2U44JPENPXJ4OEBIGUBKOC", "length": "6815", "offset":
"193897242", "filename":
"crawl-data/CC-MAIN-2022-33/segments/1659882571056.58/warc/CC-MAIN-20220809155137-20220809185137-00589.warc.gz",
"languages": "eng", "encoding": "UTF-8"}