Henry S. Thompson writes:
> Is this taken from the Content-Type response header if it's given
> there, or is it always the charset-detected reported in the metadata
> WARC record, or ???
Hmm, answering my own question, and asking another: It appears from
inspection that it's always the charset-detected as reported in the
metadata WARC record, even if that differs from what is given in the
Content-Type response header. For example, we have
cn,024h3)/?tag=
www.7798827.com 20190817234630 {"url": "
http://024h3.cn/?tag=WWW.7798827.COM", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "DXZ7RIGR2DFTBVB6LDCNUJDNH4INLNQI", "length": "72754", "offset": "49134", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027313501.0/warc/CC-MAIN-20190817222907-20190818004907-00010.warc.gz", "charset": "GBK", "languages": "zho,eng"}
in CC-MAIN-2019-35/cdx/warc/cdx-00017.gz, and in the WARC file itself
we find
In the response record:
Content-Type: text/html; charset=gb2312
In the metadata record:
charset-detected: GBK
So, further question: where is the charset-detected coming from?