charset field in URL index

23 views
Skip to first unread message

Henry S. Thompson

unread,
Nov 30, 2022, 5:12:41 AM11/30/22
to Common Crawl
Is this taken from the Content-Type response header if it's given
there, or is it always the charset-detected reported in the metadata
WARC record, or ???

Thanks,

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: h...@inf.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Henry S. Thompson

unread,
Nov 30, 2022, 5:29:14 AM11/30/22
to common...@googlegroups.com
Henry S. Thompson writes:

> Is this taken from the Content-Type response header if it's given
> there, or is it always the charset-detected reported in the metadata
> WARC record, or ???

Hmm, answering my own question, and asking another: It appears from
inspection that it's always the charset-detected as reported in the
metadata WARC record, even if that differs from what is given in the
Content-Type response header. For example, we have

cn,024h3)/?tag=www.7798827.com 20190817234630 {"url": "http://024h3.cn/?tag=WWW.7798827.COM", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "DXZ7RIGR2DFTBVB6LDCNUJDNH4INLNQI", "length": "72754", "offset": "49134", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027313501.0/warc/CC-MAIN-20190817222907-20190818004907-00010.warc.gz", "charset": "GBK", "languages": "zho,eng"}

in CC-MAIN-2019-35/cdx/warc/cdx-00017.gz, and in the WARC file itself
we find

In the response record:
Content-Type: text/html; charset=gb2312
In the metadata record:
charset-detected: GBK

So, further question: where is the charset-detected coming from?

Sebastian Nagel

unread,
Nov 30, 2022, 5:31:48 AM11/30/22
to common...@googlegroups.com
Hi Henry,

the value is the same as in the WARC metadata record.

The charset is detected using the HTTP Content-Type header, HTML
metadata and the content bytes.

Apache Tika encoding detectors are used: a composite detector [1]
combining HTMLEncodingDetector [2] and Icu4jEncodingDetector
detector [3]. You'll find the implementation in [4].

Best,
Sebastian

[1]
https://tika.apache.org/2.6.0/api/org/apache/tika/detect/CompositeEncodingDetector.html
[2]
https://tika.apache.org/2.6.0/api/org/apache/tika/parser/html/HtmlEncodingDetector.html
[3]
https://tika.apache.org/2.6.0/api/org/apache/tika/parser/txt/Icu4jEncodingDetector.html
[4]
https://github.com/commoncrawl/nutch/blob/cc/src/java/org/commoncrawl/util/LanguageDetector.java

Henry S. Thompson

unread,
Nov 30, 2022, 6:47:26 PM11/30/22
to common...@googlegroups.com
Sebastian Nagel writes:

> the value is the same as in the WARC metadata record.
>
> The charset is detected using the HTTP Content-Type header, HTML
> metadata and the content bytes.
>
> Apache Tika encoding detectors are used: ...

Thanks, very helpful. One small further clarification: it appears to
me, based on a quick check, that only index lines for responses with
one of the following two values for 'mime-detected' have a 'charset',
is that right?

text/html
application/xhtml+xml

Seems slightly surprising that e.g. text/plain or
application/(...+)xml or other text/... cases aren't covered, although
they represent at least 1% of the number of successful responses...

Sebastian Nagel

unread,
Dec 1, 2022, 4:14:04 AM12/1/22
to common...@googlegroups.com
Hi Henry,

yes, by now encoding and language detection are only done for HTML
documents.

> Seems slightly surprising that e.g. text/plain or
> application/(...+)xml or other text/... cases aren't covered, although
> they represent at least 1% of the number of successful responses...

In recent crawls plain plain text and XML formats together make about
0.5% of all crawled documents, see the second table on [1].

Of course, it's a nice-to-have to extend encoding and language detection
to all kinds of documents (where applicable). But as many potential
improvements, this just never was implemented.

Best,
Sebastian
Reply all
Reply to author
Forward
0 new messages