charset and languages fields missing in URL index for 2022-33?

Henry S. Thompson

unread,

Jan 6, 2023, 1:21:22 PM1/6/23

to common...@googlegroups.com

Subject says it all...

Maybe no-one but me cares, as I can't find any discussion of this on
this list...

Present in 2022-21 and 2022-40.

Happy New Year,

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: h...@inf.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Sebastian Nagel

unread,

Jan 10, 2023, 6:55:08 AM1/10/23

to common...@googlegroups.com

Hi Henry,

could you share some details what exactly is missing?

I've just looked into the CDX and Parquet indexes of CC-MAIN-2022-33
but the fields / columns "languages" and "encoding" resp.
"content_languages" and "content_encoding" are there. See one example below.

Please note, that only successful fetches and HTML pages are passed to
the charset and language detectors. Redirects, 404s, robots.txt
captures, PDFs, etc. are without these fields resp. contain null values.

Best,
Sebastian

{"urlkey": "org,commoncrawl)/", "timestamp": "20220809171502", "url":
"https://commoncrawl.org/", "mime": "text/html", "mime-detected":
"text/html", "status": "200", "digest":
"754LZOH7HB2U44JPENPXJ4OEBIGUBKOC", "length": "6815", "offset":
"193897242", "filename":
"crawl-data/CC-MAIN-2022-33/segments/1659882571056.58/warc/CC-MAIN-20220809155137-20220809185137-00589.warc.gz",
"languages": "eng", "encoding": "UTF-8"}

Henry S. Thompson

unread,

Jan 10, 2023, 8:13:46 AM1/10/23

to common...@googlegroups.com

Sebastian Nagel writes:

> could you share some details what exactly is missing?
>
> I've just looked into the CDX and Parquet indexes of CC-MAIN-2022-33
> but the fields / columns "languages" and "encoding" resp.
> "content_languages" and "content_encoding" are there. See one example below.
>
> Please note, that only successful fetches and HTML pages are passed to
> the charset and language detectors. Redirects, 404s, robots.txt
> captures, PDFs, etc. are without these fields resp. contain null values.

Right, my bad, I was just checking the first line of a random index
file for all the archives I have indices for, and got a diagnostics
line in that case. Apologies for the noise.

Sebastian Nagel

unread,

Jan 10, 2023, 8:59:52 AM1/10/23

to common...@googlegroups.com

Hi Henry,

no problem. I agree that it's not ideal to have null values encoded by absence.
On other hand, this is nothing I want to change as it would make the CDX index
results less readable and also blow up the index, eg. if adding a field to hold
the redirect target to all records.

Best,
Sebastian

r vn alst

unread,

Feb 20, 2023, 4:39:03 PM2/20/23

to Common Crawl

Hi,

maybe I'm doing something wrong, but I download and unzip cdx00238,

I read all lines, test for domain "nl" , status 200, and url not ending in robots.txt,

then I test for the existence of a "languages" field in the dictionary, and none of the records have that field in the dict, while there are about 6 mln candidates.

I also tried "content_languages", same result.

In which type of files are these fields expected ? the warc files, the cdx-xxxxx or only cluster.idx ?

and about names (maybe I goofed there..) are the cdx files called index files ? or is cluster.idx the only file that is called an index ?

Thanks for educating another newbie ;-)

ps : my code does not use the columnar index, just the regular one.

Regards,

Ronald van Aalst

Tom Morris

unread,

Feb 20, 2023, 6:50:09 PM2/20/23

to common...@googlegroups.com

On Mon, Feb 20, 2023 at 4:39 PM r vn alst <raa...@gmail.com> wrote:

maybe I'm doing something wrong, but I download and unzip cdx00238,
I read all lines, test for domain "nl" , status 200, and url not ending in robots.txt,
then I test for the existence of a "languages" field in the dictionary, and none of the records have that field in the dict, while there are about 6 mln candidates.

It would be easier to help if you fully specified what files you're using and showed the code which isn't working.

I just did a quick check of the index in the latest crawl and the languages fields are definitely included. Of the 8.8M .nl URL records in the index file that I looked at, only ~160K were missing the `languages` field.

$ gzcat url-languages.tsv.gz | wc -l
8766956

If you are starting with a different file, you can use these commands (or combine them into a single 1-liner) to test if your file has the same contents. It takes less than 5 minutes on my laptop and residential internet connection.

Tom

p.s. To get all the records for the .nl TLD, you'll need to process 5 index files:

$ curl -o - https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2023-06/indexes/cluster.idx | grep -E ^nl, | cut -f 2 | sort -u
cdx-00238.gz
cdx-00239.gz
cdx-00240.gz
cdx-00241.gz
cdx-00242.gz

Henry S. Thompson

unread,

Feb 21, 2023, 10:18:56 AM2/21/23

to common...@googlegroups.com

Tom Morris writes:

> On Mon, Feb 20, 2023 at 4:39 PM r vn alst <raa...@gmail.com<mailto:raa...@gmail.com>> wrote:
>
> maybe I'm doing something wrong, but I download and unzip cdx00238,

> ...

Note that the languages and charset fields are only present in crawls
since 2018-34.

r vn alst

unread,

Feb 22, 2023, 11:30:34 AM2/22/23

to Common Crawl

Thanks for the help and especially the commandlines.

I was able to reproduce these, so I can now find out what went wrong with my code.

Reply all

Reply to author

Forward